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Preface to the Second Edition 


Interest in and the resulting impact of Bayes and empirical Bayes (EB) 
methods continues to grow. An impressive expansion in the number of 
Bayesian journal articles, conference presentations, courses, seminars, and 
software packages has occurred in the four years since our first edition went 
to press. Perhaps more importantly. Bayesian methods now find applica¬ 
tion in a large and expanding number of areas where just a short time 
ago their implementation and routine use would have seemed a laudable 
yet absurdly unrealistic goal. The allocation of billions of U.S. federal dol¬ 
lars now depends on Bayesian estimates of population characteristics (e.g. 
poverty) in small geographic areas. The FDA now not only permits but en¬ 
courages Bayesian designs and analyses in applications seeking approval for 
new medical devices such as pacemakers, an area where large clinical trials 
are infeasible but directly relevant historical information is often available. 
The Microsoft Windows Office Assistant is based on a Bayesian artificial 
intelligence algorithm; similarly, a prominent movie-rental chain now offers 
a web-based Bayesian expert system to help you choose which film to rent, 
based on your general preferences and the names of other films you have en¬ 
joyed. No longer merely of academic interest, Bayes and EB methods have 
assumed their place in the toolkit of the well-equipped applied statistician. 

A principal reason for the ongoing expansion in the Bayes and EB statis¬ 
tical presence is of course the corresponding expansion in readily-available 
computing power, and the simultaneous development in Markov chain Mon¬ 
te Carlo (MCMC) methods and software for harnessing it. Our desire to 
incorporate many recent developments in this area provided one main im¬ 
petus for this second edition. For example, we now include discussions 
of reversible jump MCMC, slice sampling, structured MCMC, and other 
new computing methods and software packages. Feedback from readers and 
practitioners also helped us identify important application areas given little 
or no attention in the first edition. As such, we have expanded our cover¬ 
age of Bayesian methods in spatial statistics, sequential analysis, model 
choice, and experimental design, especially sample size estimation for clin¬ 
ical trials. This edition also features an updated reference list, expanded 
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and improved subject and author indices, and corrections of many (though 
surely not all) typographical errors. 

Still, perhaps our main motivation in preparing this second edition was 
to improve the book's effectiveness as a course text. The most significant 
change in this regard is the moving of the decision-theoretic material from 
Chapter 1 to a location (new Appendix B) specifically devoted to this topic. 
Chapter 1 now includes several vignettes and elementary examples which 
illustrate Bayesian thinking and benefits. Our intent is to provide a gentler 
and more direct introduction to the "nuts and bolts" issues of Bayesian 
analysis in Chapter 2. 

Besides these structural changes, we have attempted to enhance the 
book's teaching value in other ways. These include a gentler, better in¬ 
troduction to Gibbs sampling in Subsection 5.4.2, an explicit description 
of how to estimate MCMC standard errors in Subsection 5.4.5, a greatly 
revised treatment of Bayesian model choice in Sections 6.3, 6.4, and 6.5, 
actual BUGS and WinBUGS code to do several of the examples in the book, 
and the addition of several new homework problems and solutions. 

As with the first edition, this revision presupposes no previous exposure 
to Bayes or EB methods, but only a solid. Master's-level understanding of 
mathematical statistics - say, at the level of Hogg and Craig (1978), Mood, 
Graybill, and Boes (1974), or Casella and Berger (1990). A course on the 
basics of modern applied Bayesian methods might cover only Chapters 1, 
2, 5, and 6, with supplemental material from Chapters 3 (EB), 7 (special 
topics), and 8 (case studies) as time permitted. A more advanced course 
might instead refer to Appendix B after Chapter 2, then return to the 
EB and performance material in Chapters 3 and 4, since this material is 
directly motivated by decision theory. The advanced computing, methods, 
and data analysis material in Chapters 5 and beyond then follows naturally. 
In our course in the Division of Biostatistics at the University of Minnesota 
we cover the entire book in a four-credit-hour, single-semester (15 week) 
course; the aforementioned "short course" could surely be completed in far 
less time. See http://www.biostat.umn.edu/~brad/ on the web for many 
of our datasets and other teaching-related information. 

We owe a debt of gratitude to those who helped in our revision process. 
Sid Chib, Mark Glickman, Betz Halloran, Siem Heisterkamp, David Hig¬ 
don, Jim Hodges, Giovanni Parmigiani, Jeff Rosenthal, David Spiegelhalter, 
Melanie Wall, and Larry Wasserman all provided valuable feedback, spot¬ 
ting errors and suggesting directions our revision might take. Cong Han and 
Li Zhu provided invaluable technical support. We are especially grateful to 
Antonietta Mira for explaining slice sampling to us. Finally, we thank our 
families, whose ongoing love and support made all of this possible. 

Bradley P. Carlin Minneapolis, Minnesota 

Thomas A. Louis April 2000 
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Preface to the First Edition 


Bayes and empirical Bayes methods have been shown to have attractive 
properties for both the Bayesian and the frequentist. However, a combi¬ 
nation of philosophical posturing and difficulties in implementation have 
worked against these approaches becoming standard data-analytic tools. 
Our intent in this book is to help redress this situation by laying out the 
formal basis for Bayes and empirical Bayes (EB) approaches, evaluating 
their Bayes and frequentist properties, and showing how they can be im¬ 
plemented in complicated model settings. Our hope is that readers will 
better understand the benefits (and limitations) of Bayes and EB, and will 
increasingly use these approaches in their own work. 

Though there are several books on the philosophical and mathematical 
aspects of Bayes and EB methods, there are none we know of that strongly 
link modern developments to applications while also showing the several 
practical benefits of these approaches irrespective of one's philosophical 
viewpoint. We believe that progress in computing and the attendant mod¬ 
eling flexibility sets the stage for Bayes and empirical Bayes methods to be 
of substantial benefit for the applied statistician. 

We intend that this book will help realize this benefit by avoiding exces¬ 
sive philosophical nitpicking, and instead show how Bayes and EB proce¬ 
dures operate in both theory and practice. The book will satisfy the growing 
need for a ready reference to Bayes and EB that can be read and appre¬ 
ciated by the practicing statistician as well as the second-year graduate 
student, and a text that casts the formal development in terms of applied 
linkages. The book is intended for those with a good Master's-level under¬ 
standing of mathematical statistics, but we assume no formal exposure to 
Bayes or empirical Bayes methods. It should be appropriate for use both as 
a text and as a reference. Our somewhat informal development is designed 
to enhance understanding; mathematical proofs are kept to a minimum. 
We hope that the mathematical level and breadth of examples will recruit 
students and teachers not only from statistics and biostatistics, but from 
a broad range of fields, including for example epidemiology, econometrics, 
education, psychology, and sociology. 

Bayes and EB touches much of modern statistical practice, and we cannot 
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include all relevant theory and examples. For example, the theory of diag¬ 
nostic tests is best presented in a Bayesian fashion, and Bayesian formalism 
lies behind all random effects and errors-in-variables models. We have dealt 
with this breadth by including models and examples that effectively com¬ 
municate the approach, but make no claim to being comprehensive. For 
example, this book does little more than mention nonparametric Bayesian 
approaches, such as the use of Dirichlet process priors, even though they 
are a subject of greatly increased interest of late thanks to the addition of 
Markov chain Monte Carlo (MCMC) methods to the Bayesian toolkit. 

Philosophically, two principal themes dominate the book: that effective 
statistical procedures strike a balance between variance and bias, and that 
Bayesian formalism in a properly robust form produces procedures that 
achieve this balance. This success applies whether one evaluates methods 
as a Bayesian (considering parameters as random variables and the data 
as fixed), or as a frequentist (over a range of fixed parameter values). With 
this in mind, we begin in Chapter 1 by outlining the decision-theoretic tools 
needed to compare procedures, though unlike more traditional Bayesian 
texts (e.g., DeGroot, 1970; Berger, 1985; Robert, 1994) we do not dwell 
on them. After presenting the basics of the Bayes and EB approaches in 
Chapters 2 and 3, respectively, Chapter 4 evaluates the frequentist and 
empirical Bayes (preposterior) performance of these approaches in a variety 
of settings. Of course, no single approach can be universally best, and we 
identify both virtues and drawbacks. This sharpens our focus on generating 
procedures that do not rely heavily on prior information (i.e., that enjoy 
good frequentist and EB properties over a broad range of parameter values). 

We intend for the book to have a very practical focus, offering real solu¬ 
tion methods to researchers with challenging problems. This is especially 
true of the book's second half, which begins with an extensive discussion 
of modern Bayesian computational methods in Chapter 5. These methods 
figure prominently in modern tools for such data analytic tasks as model 
criticism and selection, which are described in Chapter 6. Guidance on the 
Bayes/EB implementation of a generous collection of special methods and 
models is given in Chapter 7. Finally, we close in Chapter 8 with three fully 
worked case studies of real datasets. These studies incorporate tools from 
a variety of statistical subfields, including longitudinal analysis, survival 
models, sequential analysis, and spatio-temporal mapping. 

The book concludes with three useful appendices. Appendix A contains 
a brief summary of the distributions used in the book, highlighting the 
typical Bayes/EB role of each. For those wishing to avoid as much computer 
programming as possible. Appendix B provides a guide to the commercial 
and noncommercial software available for performing Bayesian analyses, 
indicating the level of model complexity each is capable of handling. While 
guides such as ours become outdated with remarkable rapidity, we believe 
it to be fairly complete as of the time of our writing, and hope it will at 
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least serve as a starting point for the beginning applied Bayesian. Finally, 
Appendix C contains solutions to several of the exercises in each of the 
book's chapters. 

We encourage the student to actually try the exercises at the end of each 
chapter, since only through practice will the student gain a true appreci¬ 
ation for the (often computationally intensive) methods we advocate. We 
caution that each chapter's exercises are not necessarily arranged in order 
of increasing difficulty, but rather in roughly the same order as the topics 
in that chapter, oftentimes with the more theoretical and methodological 
questions preceding the more computational and data-analytic ones. 

We are very grateful to a variety of persons who helped make this book 
possible. Deb Sampson typed a significant portion of the first draft, and 
also proved invaluable in her role as "LATgX goddess." Wei Shen, Hong 
Xia, and Dandan Zheng spent hours writing computer simulation code and 
drawing many of the S+ plots seen throughout the book. The entire 1995 
and 1996 winter quarter "Bayes and Empirical Bayes Methods" classes 
served as aggresive copy-editors, finding flawed homework problems, miss¬ 
ing references, and an embarrassing number of misconstructed sentences. 
Finally, we thank Caroline S. Carlin. F.S.A. and Karen Seashore Louis, 
Ph.D. for their unending patience and good humor. So scarred are they 
from the experience that both now use words like "posterior sensitivity" in 
polite conversation. 

Bradley P. Carlin Minneapolis, Minnesota 

Thomas A. LOUIS March 1996 


© 2000 by CRC Press LLC 



CHAPTER 1 


Approaches for statistical inference 


1.1 Introduction 


The practicing statistician faces a variety of challenges: designing complex 
studies, summarizing complex data sets, fitting probability models, draw¬ 
ing conclusions about the present, and making predictions for the future. 
Statistical studies play an important role in scientific discovery, in policy 
formulation, and in business decisions. Applications of statistics are ubiq¬ 
uitous, and include clinical decision making, conducting an environmental 
risk assessment, setting insurance rates, deciding whether (and how) to 
market a new product, and allocating federal funds. Currently, most statis¬ 
tical analyses are performed with the help of commercial software packages, 
most of which use methods based on a classical, or frequentist, statistical 
philosophy. In this framework, maximum likelihood estimates (MLEs) and 
hypothesis tests based on /;-values figure prominently. 

Against this background, the Bayesian approach to statistical design and 
analysis is emerging as an increasingly effective and practical alternative 
to the frequentist one. Indeed, due to computing advances which enable 
relevant Bayesian designs and analyses, the philosophical battles between 
frequentists and Bayesians that were once common at professional statis¬ 
tical meetings are being replaced by a single, more eclectic approach. The 
title of our book makes clear which philosophy and approach we prefer; 
that you are reading it suggests at least some favorable disposition (or at 
the very least, curiosity) on your part as well. Rather than launch head¬ 
long into another spirited promotional campaign for Bayesian methods, 
we begin with motivating vignettes, then provide a basic introduction to 
the Bayesian formalism followed by a brief historical account of Bayes and 
frequentist approaches, with attention to some controversies. These lead 
directly to our methodological and applied investigations. 
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1.2 Motivating vignettes 

1.2.1 Personal probability 

Suppose you have submitted your first manuscript to a journal and have 
assessed the chances of its being accepted for publication. This assessment 
uses information on the journal's acceptance rate for manuscripts like yours 
(let's say around 30%), and your evaluation of the manuscript's quality. 
Subsequently, you are informed that the manuscript has been accepted 
(congratulations!). What is your updated assessment of the probability 
that your next submission (on a similar topic) will be accepted? 

The direct estimate is of course 100% (thus far, you have had one success 
in one attempt), but this estimate seems naive given what we know about 
the journal's overall acceptance rate (our external, or prior, information in 
this setting). You might thus pick a number smaller than 100%; if so, you 
are behaving as a Bayesian would because you are adjusting the (unbiased, 
but weak) direct estimate in the light of your prior information. This ability 
to formally incorporate prior information into an analysis is a hallmark of 
Bayesian methods, and one that frees the analyst from ad hoc adjustments 
of results that "don't look right." 


1.2.2 Missing data 

Consider Table 1.1, reporting an array of stable event prevalence or inci¬ 
dence estimates scaled per 10,000 population, with one value (indicated by 
"*") missing at random. The reader may think of them as geographically 
aligned disease prevalences, or perhaps as death rates cross-tabulated by 
clinic and age group. 


79 

87 

83 

80 

78 

90 

89 

92 

99 

95 

96 

100 

* 

110 

115 

101 

109 

105 

108 

112 

96 

104 

92 

101 

96 


Table 1.1 An army of well-estimated rates per 10,000 with one estimate missing 

With no direct information for *, what would you use for an estimate? 
Does 200 seem reasonable? Probably not, since the unknown rate is sur¬ 
rounded by estimates near 100. To produce an estimate for the missing 
cell you might fit an additive model (rows and columns) and then use the 
model to impute a value for *, or merely average the values in surround¬ 
ing cells. These are two examples of borrowing information. Whatever your 
approach, some number around 100 seems reasonable. 
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Now assume that we obtain data for the * cell and the estimate is, in 
fact, 200, based on 2 events in a population of 100 (200 = 10000 x 2/100). 
Would you now estimate * by 200 (a very unstable estimate based on very 
little information), when with no information a moment ago you used 100? 
While 200 is a perfectly valid estimate (though its uncertainty should be 
reported), some sort of weighted average of this direct estimate (200) and 
the indirect estimate you used when there was no direct information (100) 
seems intuitively more appealing. The Bayesian formalism allows just this 
sort of natural compromise estimate to emerge. 

Finally, repeat this mental exercise assuming that the direct estimate is 
still 200 per 10,000, but now based on 20 events in a population of 1000, 
and then on 2000 events in a population of 100,000. What estimate would 
you use in each case? Bayes and empirical Bayes methods structure this 
type of statistical decision problem, automatically giving increasing weight 
to the direct estimate as it becomes more reliable. 


1.2.3 Bioassay 

Consider a carcinogen bioassay where you are comparing a control group 
(C) and an exposed group (E) with 50 rodents in each (see Table 1.2). In 
the control group, 0 tumors are found; in the exposed group, there are 3, 
producing a non-significant, one-sided Fisher exact test p-value of approx¬ 
imately 0.125. Flowever, your colleague, who is a veterinary pathologist, 
states, "I don't know about statistical significance, but three tumors in 50 
rodents is certainly biologically significant!" 

C E Total 

Tumor 0 3 3 

No Tumor 50 47 97 

50 50 100 

Table 1.2 Hypothetical bioassay results; one-sided p = 0.125 

This belief may be based on information from other experiments in the 
same lab in the previous year in which the tumor has never shown up in 
control rodents. For example, if there were 400 historical controls in ad¬ 
dition to the 50 concurrent controls, none with a tumor, the one sided 
p-value becomes 0.001 (see Table 1.3). Statistical and biological signifi¬ 
cance are now compatible. Still, it is generally inappropriate simply to 
pool historical and concurrent information. However, Bayes and empirical 
Bayes methods may be used to structure a valid synthesis; see, for example, 
Tarone (1982), Dempster et al. (1983), Tamura and Young (1986), Louis 
and Bailey (1990), and Chen et al. (1999). 
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C E Total 


Tumor 0 3 3 

No Tumor 450 47 497 

450 50 500 


Table 1.3 Hypothetical bioassay results augmented by 400 historical controls, none 
with a tumor, one-sided p = 0.001. 


1.2.4 Attenuation adjustment 

In a standard errors-in-variables simple linear regression model, the least 

squares estimate of the regression slope [0) is biased toward 0, an example 

of attenuation . More formally suppose the true regression is y = 0x + e, 

e ~ 3V(0, c 2 ) but Y is regressed not on x but on A" = x + 6, where 

b ~ .V(0, <r|). Then the least squares estimate has expectation E\0\ m p0, 
2 

with p = 0 € , < 1. If p is known or well-estimated, one can correct for 

attenuation and produce an unbiased estimate by using 0 / p to estimate 0. 

Though unbiasedness is an attractive property, especially when the stan¬ 
dard error associated with the estimate is small, in general it is less im¬ 
portant than having the estimate "close" to the true value. The expected 
squared deviation between the true value and the estimate ( mean squared 
error, or MSE) provides an effective measure of proximity. Fortunately for 
our intuition, MSE can be written as the sum of an estimator's sampling 
variance and its squared bias, 

MSE = variance + (bias) 2 . (1-1) 

The unbiased estimate sets the second term to 0, but it can have a verv 
large MSE relative to other estimators; in this case, because dividing 0 by 
p inflates the variance as the price of eliminating bias. 

In the exercises for Appendix B, we ask that you construct a Bayesian 
estimator that strikes an effective tradeoff between variance and bias. It 
goes part way toward complete elimination of bias, with the degree of 
adjustment depending on the relative size of the variance and the bias. 

1.3 Defining the approaches 

Three principal approaches to inference guide modern data analysis: fre- 
quentist, Bayesian, and likelihood. We now briefly describe each in turn. 

The frequentist evaluates procedures based on imagining repeated sam¬ 
pling from a particular model (the likelihood) which defines the probabil¬ 
ity distribution of the observed data conditional on unknown parameters. 
Properties of the procedure are evaluated in this repeated sampling frame- 
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work for fixed values of unknown parameters; good procedures perform well 
over a broad range of parameter values. 

The Bayesian requires a sampling model and, in addition, a prior dis¬ 
tribution on all unknown quantities in the model (parameters and missing 
data). The prior and likelihood are used to compute the conditional distri¬ 
bution of the unknowns given the observed data (the posterior distribution), 
from which all statistical inferences arise. Allowing the observed data to 
play some role in determining the prior distribution produces the empirical 
Bayes (EB) approach. The Bayesian evaluates procedures for a repeated 
sampling experiment of unknowns drawn from the posterior distribution 
for a given data set. The empirical Bayesian may also evaluate procedures 
under repeated sampling of both the data and the unknowns from their 
joint distribution. 

Finally, the "likelihoodist" (or "Fisherian "(develops a sampling model 
but not a prior, as does the frequentist. However, inferences are restricted 
to procedures that use the data only as reported by the likelihood, as a 
Bayesian would. Procedure evaluations can be from a frequentist, Bayesian, 
or EB point of view. 

As presented in Appendix B, the frequentist and Bayesian approaches 
can be connected in a decision-theoretic framework, wherein one considers 
a model where the unknown parameters and observed data have a joint 
distribution. The frequentist conditions on parameters and replicates (in¬ 
tegrates) over the data; the Bayesian conditions on the data and replicates 
(integrates) over the parameters. Preposterior evaluations integrate over 
both parameters and data, and can be represented as frequentist perfor¬ 
mance averaged over the prior distribution, or as Bayesian posterior per¬ 
formance averaged over the marginal sampling distribution of the observed 
data (the sampling distribution conditional on parameters, averaged over 
the prior). Preposterior properties are relevant to both Bayesians and fre- 
quentists (see Rubin, 1984). 

Historically, frequentists have criticized Bayesian procedures for their in¬ 
ability to deal with all but the most basic examples, for overreliance on 
computationally convenient priors, and for being too fragile in their depen¬ 
dence on a specific prior (i.e., for a lack of robustness in settings where the 
data and prior conflict). Bayesians have criticized frequentists for failure to 
incorporate relevant prior information, for inefficiency, and for incoherence 
(i.e., a failure to process available information systematically, as a Bayesian 
approach would). Recent computing advances have all but eliminated con¬ 
straints on priors and models, but leave open the more fundamental critique 
of non-robustness. In this book, therefore, we shall often seek the middle 
ground: Bayes and EB procedures that offer many of the Bayes advantages, 
but do so without giving up too much frequentist robustness. Procedures 
occupying this middle ground can be thought of as having good perfor¬ 
mance over a broad range of prior distributions, but not so broad as to 
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include all priors (which would in turn require good performance over all 
possible parameter values). 

1.4 The Bayes-frequentist controversy 

While probability has been the subject of study for hundreds of years (most 
notably by mathematicians retained by rich noblemen to advise them on 
how to maximize their winnings in games of chance), statistics is a relatively 
young field. Linear regression first appeared in the work of Francis Galton 
in the late 1800s, with Karl Pearson adding correlation and goodness-of-fit 
measures around the turn of the century. The field did not really blos¬ 
som until the 1920s and 1930s, when R.A. Fisher developed the notion 
of likelihood for general estimation, and Jerzy Neyman and Egon Pearson 
developed the basis for classical hypothesis testing. A flurry of research 
activity was energized by the Second World War, which generated a wide 
variety of difficult applied problems and the first substantive government 
funding for their solution in the United States and Great Britain. 

By contrast, Bayesian methods are much older, dating to the original 
1763 paper by the Rev. Thomas Bayes, a minister and amateur mathe¬ 
matician. The area generated some interest by Laplace, Gauss, and others 
in the 19th century, but the Bayesian approach was ignored (or actively 
opposed) by the statisticians of the early 20th century. Fortunately, during 
this period several prominent non-statisticians, most notably Flarold Jef¬ 
freys (a physicist) and Arthur Bowley (an econometrician), continued to 
lobby on behalf of Bayesian ideas (which they referred to as "inverse prob¬ 
ability"). Then, beginning around 1950, statisticians such as L.J. Savage, 
Bruno de Finetti, Dennis Lindley, Jack Kiefer, and many others began 
advocating Bayesian methods as remedies for certain deficiencies in the 
classical approach. The following example illustrates such a deficiency in 
interval estimation. 

Example 1.1 Suppose Xi N(6, cr 2 ), i =1,.. n, where N denotes the 
normal (Gaussian) distribution and iid stands for "independent and iden¬ 
tically distributed." We desire a 95% interval estimate for the population 
mean 9. Provided n is sufficiently large (say, bigger than 30), a classical 
approach would use the confidence interval 

<5(x) = x ± 1.96 s/\/n , 

where x = (xq, . . . ,x n ), x is the sample mean, and .v is the sample standard 
deviation. 

This interval has the property that, on the average over repeated appli¬ 
cation, 6(x ) will fail to capture the true mean 9 only 5% of the time. An 
alternative interpretation is that, before any data are collected, the prob¬ 
ability that the interval contains the true value is 0.95. This property is 
attractive in the sense that it holds for all true values of# and <r 2 . 
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On the other hand, its use in any single data-analytic setting is some¬ 
what difficult to explain and understand. After collecting the data and 
computing 6(x) the interval either contains the true 9 or it does not; its 
coverage probability is not 0.95, but either 0 or 1. After observing x, a 
statement like, "the true 9 has a 95% chance of falling in <5(x)," is not 
valid, though most people (including most statisticians irrespective of their 
philosophical approach) interpret a confidence interval in this way. Thus 
for the frequentist, "95%" is not a conditional coverage probability, but 
rather a tag associated with the interval to indicate either how it is likely 
to perform before we evaluate it, or how it would perform over the long 
haul. A 99% frequentist interval would be wider, a 90% interval narrower, 
but, conditional on x, all would have coverage probability 0 or 1. ■ 

By contrast, Bayesian confidence intervals (known as "credible sets," and 
discussed further in Subsection 2.3.2) are free of this awkward frequentist 
interpretation. For example, conditional on the observed data, the proba¬ 
bility is 0.95 that 9 is in the 95% credible interval. Of course, this natural 
interpretation comes at the price of needing to specify a prior distribution 
for 9 so we must provide additional structure to obtain ease of interpreta¬ 
tion. 

The Neyman-Pearson testing structure can also lead to some very odd 
results, which have been even more heavily criticized by Bayesians. 

Example 1.2 Consider the following simple experiment, originally sug¬ 
gested by Lindley and Phillips (1976), and reprinted many times. Suppose 
in 12 independent tosses of a coin, I observe 9 heads and 3 tails. 1 wish 
to test the null hypothesis 7/ 0 : 0 = 1/2 versus the alternative hypoth¬ 
esis H a : 9 > 1/2, where 0 is the true probability of heads. Given only 
this much information, two choices for the sampling distribution emerge as 
candidates: 

1. Binomial: The number n = 12 tosses was fixed beforehand, and the 
random quantity X was the number of heads observed in the n tosses. 
Then X ~ Bin( 12, 9) and the likelihood function is given by 

hie) = f"V(l - 9) n ~ x = - 6) 3 . (1.2) 

2. Negative binomial: Data collection involved flipping the coin until the 
third tail appeared. Here, the random quantity X is the number of heads 
required to complete the experiment, so that X ~ NegBin{r = 3, 9), 
with likelihood function given by 

h(e)= ( r + * -1 ) fl ‘( 1 -*) r = (g 1 )* 9 *!-*) 3 - (1-3) 

Under either of these two alternatives, we can compute the p-value corre¬ 
sponding to the rejection region, "Reject H o if A' > c." Doing so using the 
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binomial likelihood (1.2) we obtain 

1 2 

«i = P 9=i (X >9) = £ ( 12 V '(1 - 0) 12 - J = .075 , 

while for the negative binomial likelihood (1.3), 

a 2 = P, = i(X>9) = f^ 2 + ^ 6i{\- 9f = .0325 . 

Thus, using the "usual" Type I error level a = .05, we see that the two 
model assumptions lead to two different decisions: we would reject. H o if X 
were assumed negative binomial, but not if it were assumed binomial. But 
there is no information given in the problem setting to help us make this 
determination, so it is not clear which analysis the frequentist should regard 
as "correct." In any case, assuming we trust the statistical model, it does 
not seem reasonable that how the experiment was monitored should have 
any bearing on our decision; surely only its results are relevant! Indeed, the 
likelihood functions tell a consistent story, since (1.2) and (1.3) differ only 
by a multiplicative constant that does not depend on 0. ■ 

A Bayesian explanation of what went wrong in the previous example 
would be that the Neyman-Pearson approach allows unobserved outcomes 
to affect the rejection decision. That is, the probability of X values "more 
extreme" than 9 (the value actually observed) was used as evidence against 
Hq in each case, even though these values did not occur. More formally, this 
is a violation of a statistical axiom known as the Likelihood Principle, a no¬ 
tion present in the work of Fisher and Barnard, but not precisely- defined 
until the landmark paper by Birnbaum (1962). In a nutshell, the Likeli¬ 
hood Principle states that once the data value x has been observed, the 
likelihood function L(9\x) contains all relevant experimental information 
delivered by x about the unknown parameter 6. In the previous exam¬ 
ple, L x and L 2 are proportional to each other as functions of 0, hence are 
equivalent in terms of experimental information (recall that multiplying a 
likelihood function by an arbitrary function h(x ) does not change the MLE 
B). Yet in the Neyman-Pearson formulation, these equivalent likelihoods 
lead to two different inferences regarding 6. Put another way, frequentist 
test results actually depend not only on what x was observed, but on how 
the experiment was stopped. 

Some statisticians attempt to defend the results of Example 1.2 by ar¬ 
guing that all aspects of the design of an experiment are relevant pieces 
of information even after the data have been collected, or perhaps that 
the Likelihood Principle is itself flawed (or at least should not be consid¬ 
ered sacrosanct). We do not delve deeper into this foundational discussion, 
but refer the interested reader to the excellent monograph by Berger and 
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Wolpert (1984) for a presentation of the consequences, criticisms, and de¬ 
fenses of the Likelihood Principle. Violation of the Likelihood Principle is 
but one of the possibly anomalous properties of classical testing methods; 
more are outlined later in Subsection 2.3.3, where it is also shown that 
Bayesian hypothesis testing methods overcome these difficulties. 

If Bayesian methods offer a solution to these and other drawbacks of the 
frequentist approach that were publicized several decades ago, it is per¬ 
haps surprising that Bayesian methodology did not make bigger inroads 
into actual statistical practice until only recently. There are several rea¬ 
sons for this. First, the initial staunch advocates of Bayesian methods were 
primarily subjectivists, in that they argued forcefully that all statistical 
calculations should be done only after one's own personal prior beliefs on 
the subject had been carefully evaluated and quantified. But this raised 
concerns on the part of classical statisticians (and some Bayesians) that 
the results obtained would not be objectively valid, and could be manip¬ 
ulated in any way the statististician saw fit. (The subjectivists' reply that 
frequentist methods were invalid anyway and should be discarded did lit¬ 
tle to assuage these concerns.) Second, and perhaps more important from 
an applied standpoint, the Bayesian alternative, while theoretically sim¬ 
ple, required evaluation of complex integrals in even fairly rudimentary 
problems. Without inexpensive, high-speed computing, this practical im¬ 
pediment combined with the theoretical concerns to limit the growth of 
realistic Bayesian data analysis. However, this growth finally did occur in 
the 1980s, thanks to a more objective group of Bayesians with access to 
inexpensive, fast computing. The objectivity issue is discussed further in 
Chapter 2, while computing is the subject of Chapter 5. 

These two main concerns regarding routine use of Bayesian analysis (i.e., 
that its use is not easy or automatic, and that it is not always clear how 
objective results may be obtained) were raised in the widely-read paper 
by Efron (1986). While the former issue has been largely resolved in the 
intervening years thanks to computing advances, the latter remains a chal¬ 
lenge and the subject of intense ongoing research (see Subsection 2.2.3 for 
an introduction to this area). Still, we contend that the advantages in us¬ 
ing Bayes and EB methods justify the increased effort in computation and 
prior determination. Many of these advantages are presented in detail in the 
popular textbook by Berger (1985, Section 4.1); we quickly review several 
of them here. 

1. Bayesian methods provide the user with the ability to formally incorpo¬ 
rate prior information. 

2. Inferences are conditional on the actual data. 

3. The reason for stopping the experimentation does not affect Bayesian 
inference (as shown in Example 1.2). 
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4. Bayesian answers are more easily interpretable by nonspecialists (as 
shown in Example 1.1). 

5. All Bayesian analyses follow directly from the posterior distribution; no 
separate theories of estimation, testing, multiple comparisons, etc. are 
needed. 

6. Any question can be directly answered through Bayesian analysis. 

7. Bayes and EB procedures automatically possess numerous optimality 
properties. 

As an example of the sixth point, to investigate the bioequivalence of two 
drugs, we would need to test H 0 : 9 X f 9 2 versus H a : 0\ - On. That is. 
we must reverse the traditional null and alternative roles, since the hy¬ 
pothesis we hope to reject is that the drugs are different, not that they 
are the same. This reversal turns out to make things quite awkward for 
traditional testing methods (see e.g., Berger and Hsu, 1996 for a careful 
review), but not for Bayesian testing methods, since they treat the null 
and alternative hypotheses equivalently; they really can "accept" the null, 
rather than merely "fail to reject" it. Finally, regarding the seventh point. 
Bayes procedures are typically consistent, can automatically impose parsi¬ 
mony in model choice, and can even define the class of optimal frequentist 
procedures, thus "beating the frequentist at his own game." We return to 
this final issue (frequentist motivations for using Bayes and EB procedures) 
in Chapter 4. 


1.5 Some basic Bayesian models 


The most basic Bayesian model has two stages, with a likelihood specifica¬ 
tion Y\6 ~ f(y\0) and a prior specification 0 ~ n(0), where either Y or 6 
can be vectors. In the simplest Bayesian analysis, 7 r is assumed known, so 
that by probability calculus, the posterior distribution of 0 is given by 

f(y\° Mg) n 

m{y) ’ (1 ‘ 4) 

where 

m(y) = J f(y\ 0 )tt( 0)d0 , (1.5) 

the marginal density of the data y. Equation (1.4) is a special case of 
Bayes' Theorem, the general form of which we present in Section 2.1. For 
all but rather special choices of/ and 7r (see Subsection 2.2.2), evaluating 
integrals such as (1.5) used to be difficult or impossible, forcing Bayesians 
into unappealing approximations. However, recent developments in Monte 
Carlo computing methods (see Chapter 5) allow accurate estimation of 
such integrals, and have thus enabled advanced Bayesian data analysis. 
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1.5.1 A Gaussian/Gaussian (nonnal/normal) model 


We now consider the case where both the prior and the likelihood are 
Gaussian (normal) distributions, * namely, 

9 ~ N(p,r 2 ) (1.6) 

Y\9 ~ N(9, a 2 ). 

The marginal distribution of Y given in (1.5) turns out to be N(p, a 2 + r 2 ), 
and the posterior distribution (1.4) is also Gaussian with mean and variance 

E(0\Y) = Bp + {1-B)Y (1.7) 

Var(6\Y) = (1 - B)a 2 (1.8) 

where B = a 2 /{a 2 + r 2 ). Since 0 < B < 1, the posterior mean is a weighted 
average of the prior mean p and the direct estimate Y ; the direct estimate is 
pulled back (or shrunk) toward the prior mean. Moreover, the weight on the 
prior mean B depends on the relative variability of the prior distribution 
and the likelihood. If a 2 is large relative to r 2 (i.e., our prior knowledge is 
more precise than the data information), then B is close to 1, producing 
substantial shrinkage. If a' 1 is small (i.e., our prior knowledge is imprecise 
relative to the data information), B is close to 0 and the direct estimate 
is moved very little toward the prior mean. As we show in Chapter 4, this 
shrinkage provides an effective tradeoff between variance and bias, with 
beneficial effects on the resulting mean squared error; see equation (1.1). 

If one is willing to assume that the structure (1.6) holds, but that the 
prior mean p and variance r 2 are unknown, then empirical Bayes or hyper¬ 
prior Bayes methods can be used; these methods are the subject of much 
of the remainder of this book. 


1.5.2 A beta/binomial model 

Next, consider applying the Bayesian approach given in (1.4)-( 1.5) to esti¬ 
mating a binomial success probability. With Y the number of events in n 
independent trials and 9 the event probability, the sampling distribution is 

P(Y = y\0) = f(y | 9) = Q^(l - 9) n ~y . 

In order to obtain a closed form for the marginal distribution, we use the 
Beta(a, b) prior distribution for 9 (see Section A.2 in Appendix A). For 
convenience we reparametrize from (a, b) to p M) where p =a/(a+b), the 
prior mean, and M =a + b, a measure of prior precision. More specifically, 
the prior variance is then p( 1 - p)/(M + 1), a decreasing function of M. 
As shown in detail in Subsection 3.3.2, the marginal distribution of Y is 

* Throughout the book, we use the terms "normal" and "Gaussian" interchangeably. 
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then beta-binomial with E (YIn) = jj, and 


V ar 



M 1 -ft) 

n 



n — 1 
M+ 1 


The term in square brackets is known variously as the "variance inflation 
factor," "design effect," or "component of extra-binomial variation." 

The posterior distribution of $ given Y in (1.4) is again Beta with mean 


e 


M n /Y\ 

M + M + n \ n / ’ 


fi + 


n 

M + n 



( 1 . 9 ) 


and variance Var(9 | Y) = [0(1 - 0)\/(M + n). Note that, similar to the 
posterior mean (1.7) in the Gaussian/Gaussian example, 0 is a weighted 
average of the prior mean p and the maximum likelihood estimate YIn, 
with weight depending on the relative size of M (the information in the 
prior) and n (the information in the data). 


Discussion 

Statistical decision rules can be generated by any philosophy under any col¬ 
lection of assumptions. They can then be evaluated by any criteria, even 
those arising from an utterly different philosophy. We contend (and the 
subsequent chapters will show) that the Bayesian approach is an excel¬ 
lent "procedure generator," even if one's evaluation criteria are frequentist 
provided that the prior distributions introduce only a small amount of in¬ 
formation. This agnostic view considers features of the prior (possibly the 
entire prior) as "tuning parameters" that can be used to produce a decision 
rule with broad validity. The Bayesian formalism will be even more effective 
if one desires to structure an analysis using either personal opinion or ob¬ 
jective information external to the current data set. The Bayesian approach 
also encourages documenting assumptions and quantifying uncertainty. Of 
course, no approach automatically produces broadly valid inferences, even 
in the context of the Bayesian models. A procedure generated by a high- 
information prior with most of its mass far from the truth will perform 
poorly under both Bayesian and frequentist evaluations. 

The experienced statistician prefers design and analysis methods that 
strike an effective tradeoff between efficiency and robustness. Striking a 
balance is of principal importance, irrespective of the underlying philoso¬ 
phy. For example, in estimation central focus should be on reduction of MSE 
and related performance measures through a tradeoff between variance and 
bias. This concept is appropriate for both frequentists and Bayesians. In 
this context our strategy is to use the Bayesian formalism to reduce MSE 
even when evaluations are frequentist. In a broad range of situations, one 
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can gain much of the Bayes or empirical Bayes advantage while retaining 
frequentist robustness. 

Importantly, the Bayesian formalism properly propagates uncertainty 
through the analysis enabling a more realistic (typically inflated) assess¬ 
ment of the variability in estimated quantities of interest. Also, the for¬ 
malism structures the analysis of complicated models where intuition may 
produce faulty or inefficient approaches. This structuring becomes espe¬ 
cially important in multiparameter models in which the Bayesian approach 
requires that the joint posterior distribution of all parameters structure all 
analyses. Appropriate marginal distributions focus on specific parameters, 
and this integration ensures that all uncertainties influence the spread and 
shape of the marginal posterior. 

1.6 Exercises 

1. Let 0 be the true proportion of men over the age of 40 in your community 
with hypertension. Consider the following "thought experiment" 

(a) Though you may have little or no expertise in this area, give an initial 
point estimate of 0 . 

(b) Now suppose a survey to estimate 0 is established in your community, 
and of the first 5 randomly selected men, 4 are hypertensive. How does 
this information affect, your initial estimate of 0 ? 

(c) Finally, suppose that at the survey's completion, 400 of 1000 men 
have emerged as hypertensive. Now what is your estimate of 6 1 

What guidelines for statistical inferential systems are suggested by your 
answers? 

2. Repeat the journal publication thought problem from Subsection 1.2.1 
for the situation where 

(a) you have won a lottery on your first try. 

(b) you have correctly predicted the winner of the first game of the World 

Series (professional baseball). 

3. Assume you have developed predictive distributions of the length of time 
it takes to drive to work, one distribution for Route A and one for Route 
B. What summaries of these distributions would you use to select a route 

(a) to maximize the probability that you make it to work in less than 30 
minutes? 

(b) to minimize your average commuting time? 

4. For predictive distributions of survival time associated with two medical 
treatments, propose treatment selection criteria that are meaningful to 
you (or if you prefer, to society). 
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5. Here is an example in a vein similar to that of Example 1.2, and orig¬ 
inally presented by Berger and Berry (1988). Consider a clinical trial 
established to study the effectiveness of Vitamin C in treating the com¬ 
mon cold. After grouping subjects into pairs based on baseline variables 
such as gender, age, and health status, we randomly assign one member 
of each pair to receive Vitamin C, with the other receiving a placebo. 
We then count how many pairs had Vitamin C giving superior relief 
after 48 hours. We wish to test Hq ^Vitamin C better) = i versus 
H a : P( Vitamin C better) ^ k- 

(a) Consider the expermental design wherein we sample n = 17 pairs, 
and observe x = 13 preferences for Vitamin C. What is the /7-value 
for the above two-sided test? 

(b) Now consider a two-stage design, wherein we first sample nj =17 
pairs, and observe x ( preferences for Vitamin C. In this design, if 
X\ > 13 or Xi < 4, we stop and reject H Q Otherwise, we sample an 
additional n 2 = 27 pairs, and subsequently reject H 0 if X 1 + X 2 >29 
or X(+X? < 15. (This second stage rejection region was chosen since 
under H 0 , P(X 1 + X 2 > 29 or .V, + X 2 < 15) = P(X l > 13 or X x < 
4), the /7-value in part (a) above.) 

If we once again observe Xj = 13, what is the /7-value under this new 
design? Is your answer consistent with that in part (a )? 

(c) What would be the impact on the p-value if we kept adding stages to 
the design, but kept observing xy = 13? 

(d) How would you analyze these data in the presence of a necessary but 
unforeseen change in the design - say, because the first 5 patients 
developed an allergic reaction to the treatment, and the trial was 
stopped by its clinicians. 

(e) What does all this suggest about the claim that /7-values constitute 
"objective evidence" against // 0 ? 

6. In the Normal/Normal example of Subsection 1.5.1, let cr 2 = 2 , ft = 0, 
and t =2. 

(a) Suppose we observe y = 4. What are the mean and variance of the 
resulting posterior distribution? Sketch the prior, likelihood, and pos¬ 
terior on a single set of coordinate axes. 

(b) Repeat part (a) assuming r 2 = 18. Explain any resulting differences. 
Which of these two priors would likely have more appeal for a fre- 
quentist statistician? 

7. In the basic diagnostic test setting, a disease is either present (D = 1) 
or absent ( D= 0), and the test indicates either disease (T = 1) or no 
disease (T = 0). Represent P{D = d\T = 1) in terms of test sensitivity. 
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PCI = 1 \I) = 1), specificity, P(T = 0\D = O), and disease prevalence, 
P(I) = |) , and relate to Bayes' theorem (1.4). 

8. In analyzing data from a Bin(n, 0) likelihood, the MLE is 8mle = Y In, 
which has MSE = E y ^(e M LE - 0? = Var y \ B {e M LE) = ®(1 - 0)/n. Find 
the MSE of the estimator 6Bayes = ( Y + 1)/(« + 2) and discuss in what 
contexts you would prefer it over Omle- (0Bayes is the estimator from 
equation (1.9) with p = 1/2 and M = 2.) 
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CHAPTER 2 


The Bayes approach 


2.1 Introduction 


We begin by reviewing the fundamentals introduced in Chapter 1. In the 
Bayesian approach, in addition to specifying the model for the observed 
data y = y x ,..., y n given a vector of unknown parameters 8 usually 
in the form of a probability distribution f(y\ 6), we suppose that 8 is a 
random quantity as well, having a prior distribution 7 r( 0 |r/), where rj is 
a vector of hypeiparameters. Inference concerning 8 is then based on its 
posterior distribution, given by 


p{Q\y,v) = 


p{ y>%) = p{ y, Q\ri) 

p(y|»?) J>(y,u|rj) du 

f(y\8)it(8\ri) 
//(y|u)7r(u|rj) du ' 


( 2 . 1 ) 


We refer to this formula as Bayes' Theorem. Notice the contribution of both 
the experimental data (in the form of the likelihood f) and prior opinion (in 
the form of the prior n to the posterior in the last expression of equation 
(2.1). The result of the integral in the denominator is sometimes written 
as ni(y\rf ), the marginal distribution of the data y given the value of the 
hyperparameter rj. If rj is known, we will often suppress it in the notation, 
since there is no need to express conditioning on a constant. For example, 
the posterior distribution would be written simply as p( 0ly). 


Example 2.1 Consider the normal (Gaussian) likelihood, where f(y |0) = 
^ exp{ - (y-») y (z sft () (z ■>}f' and a is a known positive constant. 

Henceforth we employ the shorthand notation f(y 1 6 = N(y\0 a 2 ) to de¬ 
note a normal density with mean 9 and variance a 2 . Suppose we take 
Tt(6\r]) = N(9\p, t 2 ) where p and r > 0 are known hyperparameters, 
so that rj = (p, t). By plugging these expressions into equation (2.1), it is 
fairly easy to show (Exercise 1 in Section 2.6) that the posterior distribution 
for 9 is given by 


p(9\y) = N 


a 2 p + r 2 y 
a 2 + r 2 


ct 2 t 2 \ 
a 2 + t 2 ) 


( 2 . 2 ) 
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Writing B — a f +r2 we see that this posterior distribution has mean B p + 
(1 - B)y and variance Bt 2 = (1-5) a 2 Since 0 <5 < 1, the posterior 
mean is a weighted average of the prior mean and the observed data value, 
with weights that are inversely proportional to the corresponding variances. 
For this reason, B is sometimes called a shrinkage factor, since it gives the 
proportion of the distance that the posterior mean is "shrunk back" from 
the ordinary frequentist estimate y toward the prior mean p. Note that 
when r 2 is large relative to cr 2 (i.e., vague prior information) B is small 
and the posterior mean is close to the data value y. On the other hand, 
when t 2 is small relative to a' 2 (i.e., a highly informative prior), B is large 
and the posterior mean is close to the prior mean p. 

Turning to the posterior variance, note that it is smaller than the vari¬ 
ance of both the prior and the likelihood. In particular, it is easy to show 
(Exercise 1 in Section 2.6) that the precision in the posterior is the sum of 
the precisions in the likelihood and the prior, where precision is defined to 
be the reciprocal of the variance. Thus the posterior distribution offers a 
sensible compromise between our prior opinion and the observed data, and 
the combined strength of the two sources of information leads to increased 
precision in our understanding of 0 m 


Given a sample of n independent observations we can obtain f(y\9 ) as 
nju/(Ki«) and proceed with equation (2.1). But evaluating this expres¬ 
sion may be simpler if we can find a statistic S(y) which is sufficient for 
8 (that is, for which f(y\9 = h(y)g(S(y)\ 6))- To see this, note that for 
S( y) = 5, 


P(%) 


/jyjgMg) 

J7(yl u Mu) du 
5(y)7/(.S’(y)i<9)7r(6>) 
/*(y)s'(5'(y)|u)7r(u) dll 


g(s\8)Tr{8) 

m(s) 


p(0|s) , 


provided m(s) > 0, since h(y) cancels in the numerator and denominator of 
the middle expression. So if we can find a sufficient statistic, we may work 
with it instead of the entire dataset y, thus reducing the dimensionality of 
the problem. 


Example 2.2 Consider again the normal/normal setting of Example 2. 1 , 
but where we now have an independent sample of size n from/(yI ff). Since 
S(y) =V i s sufficient for 0 we have that p(8\v) = \{6\y). But. since we 
know that f(y\0)= N(0,(j 2 /n), equation (2.2) implies that 


P(%) 


N [<r 2 /n)r 2 \ 

V I (<J 2 /n) + r 2 ' (<r 2 /n) + r 2 J 
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Figure 2.1 Prior and posterior distributions based on samples of size 1 and 10, 
normal/normal model. 


= N (g I ^ \ 

\ I <j 2 + nr 2 ’ cr 2 + nr 2 y 

As a concrete example, suppose that ^ = 2, r = 1, y = 6 , and cr = 1. 
Figure 2.1 plots the prior distribution, along with the posterior distribu¬ 
tions arising from two different sample sizes. When n = 1, the prior and 
likelihood receive equal weight, and hence the posterior mean is 4 = . 

When n =10, the data dominate the prior, resulting in a posterior mean 
much closer to y. Notice that the posterior variance also shrinks as n gets 
larger; the posterior collapses to a point mass on y as n tends to infinity. ■ 


If we are unsure as to the proper value for ry, the proper Bayesian solution 
would be to quantify this uncertainty in a second-stage prior distribution 
(sometimes called a hyperprior). Denoting this distribution by h( 77 ), the 
desired posterior for d is now obtained by also marginalizing over r/. 


p(0|y) 


p(y, g) _ fp(y,Q,v) d-g 

p( y) J f p(y,u,r]) dri du 
f f{yWM&\y)h{r 1 ) drj 

J f f(y\uMu\v)h(v) dr l du ’ 


(2.3) 
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Alternatively, we might simply replace 77 by an estimate 17 obtained as the 
value which maximizes the marginal distribution m(y\r}) viewed as a func¬ 
tion of 77 . Inference is now based on the estimated posterior distribution 
p(0\y , 77 ), obtained by plugging 77 into equation (2.1) above. Such an ap¬ 
proach is often referred to as empirical Bayes analysis, since we are using 
the data to estimate the prior parameter 77 . This is the subject of Chap¬ 
ter 3, where we also draw comparisons with the straight Bayesian approach 
using hyperpriors (c.f. Meeden, 1972; Deely and Lindley, 1981). 

Of course, in principle, there is no reason why the hyperprior for 77 can¬ 
not itself depend on a collection of unknown parameters A, resulting in a 
generalization of (2.3) featuring a second-stage prior h( 77 1A and a third- 
stage prior g( A). This enterprise of specifying a model over several levels 
is called hierarchical modeling, with each new distribution forming a new 
level in the hierarchy. The proper number of levels varies with the problem, 
though typically our knowledge at levels above the second prior stage is suf¬ 
ficiently vague that additional levels are of little benefit. Moreover, since we 
are continually adding randomness as we move down the hierarchy, subtle 
changes to levels near the top are not likely to have much of an impact 
on the one at the bottom (the data level), which is typically the only one 
for which we actually have observations. In order to concentrate on more 
foundational issues, in this chapter we will typically limit ourselves to the 
simplest model having only two levels (likelihood and prior). 

Continuing then with this two-level model (and again suppressing the 
dependence of the prior on the known value of 77 ), we observe that equation 
( 2 . 1 ) may be expressed in the convenient shorthand 

p{0\y) cx f(y\6)ir(6) , 

or in words, "the posterior is proportional to the likelihood times the prior." 
Clearly the likelihood may be multiplied by any constant (or any function 
of y alone) without altering the posterior. Bayes Theorem may also be used 
sequentially: suppose we have two independently collected samples of data, 
y! and y 2 ' Then 

p(%i,y 2 ) oc /(yi,y 2 |0M0) 

= /2(y2|0)/i(yi|0M0) 
oc / 2 (y 2 |0)p(0|yi) • (2-4) 

That is, we can obtain the posterior for the full dataset yi, y 2 by first 
finding /X#|yi) and then treating it as the prior for the second portion of 
the data y 2 . This easy algorithm for updating the posterior is quite natural 
when the data arrive sequentially over time, as in a clinical trial or perhaps 
a business or economic setting. 

Many authors (notably Geisser, 1993) have argued that concentrating 
on inference for the model parameters is misguided, since 0 is merely an 
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unobservable, theoretical quantity. Switching to a different model for the 
data may result in an entirely different 9 vector. Moreover, even a perfect 
understanding of the model does not constitute a direct attack on the 
problem of predicting how the system under study will behave in the future 
- often the real goal of a statistical analysis. To this end, suppose that y n+ j 
is a future observation, independent of y given the underlying 9. Then the 
predictive distribution for yn+1 is given by 

p(y n+ i |y) = j f(y n+ 1 \0)p(9\y)d9 , (2.5) 

thanks to the conditional independence of y n+ i and y. This distribution 
summarizes the information concerning the likely value of a new obser¬ 
vation, given the likelihood, the prior, and the data we have observed 
so far. (We remark that some authors refer to this distribution as the 
posterior predictive, and refer to the marginal distribution ra(2/n+i ) = 
f /(?/ n +i \9)ir(9)d9 as the prior predictive, since the latter summarizes our 
information concerning y n + 1 before having seen the data.) 

The Bayesian decision-making paradigm improves on the traditional, fre- 
quentist approach to statistical analysis in its more philosophically sound 
foundation, its unified and streamlined approach to data analysis, and its 
ability to formally incorporate the prior opinion of one or more experi¬ 
menters into the results via the prior distribution tt. Practicing statisti¬ 
cians and biostatisticians, once reluctant to adopt the Bayesian approach 
due to general skepticism concerning its philosophy and a lack of neces¬ 
sary computational tools, are now turning to it with increasing regularity 
as traditional analytic approaches emerge as both theoretically and prac¬ 
tically inadequate. For example, hierarchical Bayes methods form an ideal 
setting for combining information from several published studies of the 
same research area, an emerging scientific discipline commonly referred to 
as meta-analysis (DuMouchel and Harris, 1983; Cooper and Hedges, 1994). 
More generally, the hierarchical structure allows for honest assessment of 
heterogeneity both within and between groups such as laboratories or cen¬ 
sus areas. The review article by Breslow (1990) and the accompanying 
discussion provide an excellent summary of past and potential future ap¬ 
plication areas for Bayes and empirical Bayes methods in the public health 
and biomedical sciences. 

For statistical models of even moderate complexity, the integrals appear¬ 
ing in (2.1) and (2.3) are not tractable in closed form, and thus must be 
evaluated numerically. Fortunately, this problem has been largely solved 
in recent years thanks to the development of methodology designed to 
take advantage of modern computing power. In particular, sampling-based 
methods for estimating hyperparameters and computing posterior distri¬ 
butions play a central role in modern Bayes and empirical Bayes analysis. 
Chapter 5 contains an extensive discussion of the various computational 
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methods useful in Bayesian data analysis; for the time being, we assume 
that all relevant posterior distributions may be obtained, either analytically 
or numerically. 

In the remainder of this chapter, we discuss the steps necessary to con¬ 
duct a fully Bayesian data analysis, including prior specification, posterior 
and predictive inference, and diagnosis of departures from the assumed 
prior and likelihood form. Throughout our discussion, we illustrate with 
examples from the realms of biomedical science, public health, and envi¬ 
ronmental risk assessment as appropriate. 

2.2 Prior distributions 

Implementation of the Bayesian approach as indicated in the previous sub¬ 
section depends on a willingness to assign probability distributions not only 
to data variables like y, but also to parameters like 6. Such a requirement 
may or may not be consistent with the usual long-run frequency notion of 
probability. For example, if 

8 = true probability of success for a new surgical procedure, 

then it is possible (at least conceptually) to think of 6 as the limiting value 
of the observed success rate as the procedure is independently repeated 
again and again. But if 

0 = true proportion of U.S. men who are HIV-positive, 

the long-term frequency notion does not apply - it is not possible to even 
imagine "running the HIV epidemic over again" and reobserving 0 More¬ 
over, the randomness in 6 does not arise from any real-world mechanism; if 
an accurate census of all men and their HIV status were available, 8 could 
be computed exactly. Rather, here 0 is random only because it is unknown 
to us, though we may have some feelings about it (say, that 9 = .05 is 
more likely than 8 = .50). Bayesian analysis is predicated on such a belief 
in subjective probability, wherein we quantify whatever feelings (however 
vague) we may have about 8 before we look at the data y in a distribution 
7T. This distribution is then updated by the data via Bayes' Theorem (as 
in (2.1) or (2.3)) with the resulting posterior distribution reflecting a blend 
of the information in the data and the prior. 

Historically, a major impediment to widespread use of the Bayesian 
paradigm has been that determination of the appropriate form of the prior 
7 r (and perhaps the hyperprior h) is often an arduous task. Typically, these 
distributions are specified based on information accumulated from past 
studies or from the opinions of subject-area experts. In order to streamline 
the elicitation process, as well as simplify the subsequent computational 
burden, experimenters often limit this choice somewhat by restricting 7 r to 
some familiar distributional family. An even simpler alternative, available 
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in some cases, is to endow the prior distribution with little informative 
content, so that the data from the current study will be the dominant 
force in determining the posterior distribution. We address each of these 
approaches in turn. 

2.2.1 Elicited Priors 

Suppose for the moment that 9 is univariate. Perhaps the simplest approach 
to specifying n(9) is first to limit consideration to a manageable collection of 
9 values deemed "possible," and subsequently to assign probability masses 
to these values in such a way that their sum is 1, their relative contribu¬ 
tions reflecting the experimenter's prior beliefs as closely as possible. If $ 
is discrete-valued, such an approach may be quite natural, though perhaps 
quite time-consuming. If 0 is continuous, we must instead assign the masses 
to intervals on the real line, rather than to single points, resulting in a his¬ 
togram prior for $. Such a histogram (necessarily over a bounded region) 
may seem inappropriate, especially in concert with a continuous likelihood 
f(y\0) but may in fact be more appropriate if the integrals required to 
compute the posterior distribution must be done numerically, since com¬ 
puters are themselves discrete finite instruments. Moreover, a histogram 
prior may have as many bins as the patience of the elicitee and the accu¬ 
racy of his prior opinion will allow. It is vitally important, however, that 
the range of the histogram be sufficiently wide, since, as can be seen from 
(2.1), the support of the posterior will necessarily be a subset of that of 
the prior. That is, the data cannot lend credence to intervals for Q deemed 
"impossible" in the prior. 

Alternatively, we might simply assume that the prior for 9 belongs to 
a parametric distributional family n(9\ri) choosing r) so that the result 
matches the elicitee's true prior beliefs as nearly as possible. For exam¬ 
ple, if i] were two-dimensional, then specification of two moments (say, the 
mean and the variance) or two quantiles (say, the 50th and 95th) would 
be sufficient to determine its exact value. This approach limits the effort 
required of the elicitee, and also overcomes the finite support problem in¬ 
herent in the histogram approach. It may also lead to simplifications in the 
posterior computation, as we shall see in Subsection 2.2.2. 

A limitation of this approach is of course that it may not be possible for 
the elicitee to "shoehorn" his or her prior beliefs into any of the standard 
parametric forms. In addition, two distributions which look virtually iden¬ 
tical may in fact have quite different properties. For example, Berger (1985, 
p. 79) points out that the Cauchy(0,l) and Normal(0, 2.19) distributions 
have identical 25th, 50th, and 75th percentiles (-1, 0, and 1, respectively) 
and density functions that appear very similar when plotted, yet may lead 
to quite different posterior distributions. 

An opportunity to experiment with these two types of prior elicitation 
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in a simple setting is afforded below in Exercise 4. We now illustrate mul¬ 
tivariate elicitation in a challenging, real-life setting. 

Example 2.3 In attempting to model the probability pijki of an incorrect 
response by person j in cue group i on a working memory exam question 
having k "simple conditions" and "query complexity" level /, Carlin, Kass, 
Lerch, and Huguenard (1992) consider the model 


logit(pijki) 


Of + lk, 

/ = 0 

+ yk +a, 

1= 1 

(fp + fk + a + /?, 

1=2 


CO 

In this model, Oj denotes the effect for subject ij, y denotes the effect due 
to the number of simple conditions in the question, a measures the marginal 
increase in the logit error rate in moving from complexity level 0 to 1, and 
j3 does the same in moving from complexity level 1 to 2. Prior information 
concerning the parameters was based primarily on rough impressions, pre¬ 
cluding precise prior elicitation. As such, j, a, ;3) was defined as a 

product of independent, normally distributed components, namely, 


/ J 

n n "«f \m,v 2 e) X N(y\p, y , rf) x N 2 {a,/3\p a ,p l} ,<Tl,aj,p) (2.6) 

i = 1 j = 1 


where N represents the univariate normal distribution and N 2 represents 
the bivariate normal distribution. Hyperparameter mean parameters were 
chosen by eliciting "most likely" error rates, and subsequently converting 
to the logit scale. Next, variances were determined by considering the effect 
on the error rate scale of a postulated standard deviation (and resulting 
95% confidence set) on the logit scale. Note that the symmetry imposed by 
the normal priors on the logit scale does not carry over to the error rate 
(probability) scale. While neither normality nor prior independence was 
deemed totally realistic, prior (2.6) was able to provide a suitable starting 
place, subsequently updated by the information in the experimental data. 


Even when scientifically relevant prior information is available, elicitation 
of the precise forms for the prior distribution from the experimenters can 
be a long and tedious process. Some progress has been made in this area for 
normal linear models, most notably by Kadane et al. (1980). In the main, 
however, prior elicitation issues tend to be application- and elicitee-specific, 
meaning that general purpose algorithms are typically unavailable. Wolfson 
(1995) and Chaloner (1996) provide overviews of the various philosophies 
of elicitation (based on the ways that humans think about and update 
probabilistic statements), as well as reviews of the methods proposed for 
various models and distributional settings. 

The difficulty of prior elicitation has been ameliorated somewhat through 
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the addition of interactive computing, especially dynamic graphics, and 
object-oriented computer languages such as S (Becker, Chambers, and 
Wilks, 1988) or XLISP-STAT (Tierney, 1990). 

Example 2.4 In the arena of monitoring clinical trials, Chaloner et al. 
(1993) show how to combine histogram elicitation, matching a functional 
form, and interactive graphical methods. Following the advice of Kadane 
et al. (1980), these authors elicit a prior not on 0 (in this case, an unob¬ 
servable proportional hazards regression parameter), but on corresponding 
observable quantities familiar to their medically-oriented elicitees, namely 
the proportion of individuals failing within two years in a population of 
control patients, po and the corresponding two-year failure proportion in 
a population of treated patients, p v Writing the survivor function at time 
t for the controls as S(t), under the proportional hazards model we then 
have that p 0 = 1 - S( 2) and = 1 - S( 2 ) exp W Hence the equation 

log[ log( 1 - Pi)] = 0 + log[- log( 1 - Po )] (2.7) 

gives the relationship between po,Pi and 9. 

Since the effect of a treatment is typically thought of by clinicians as 
being relative to the baseline (control) failure probability they first 
elicit a best guess for this rate, p {) . Conditional on this modal value, they 
then elicit an entire distribution for p i, beginning with initial guesses for 
the upper and lower quartiles of p\'s distribution. These values determine 
initial guesses for the parameters g and a of a smooth density function that 
corresponds on the 0 scale to an extreme value distribution. This smooth 
functional form is then displayed on the computer screen and the elicitee 
is allowed to experiment with new values for g and a in order to obtain 
an even better fit to his or her true prior beliefs. Finally, fine tuning of the 
density is allowed via mouse input directly onto the screen. The density is 
restandardized after each such change, with updated quantiles computed 
for the elicitee's approval. At the conclusion of this process, the final prior 
distribution is discretized onto a suitably fine grid of points (pn ,..., Pik), 
and finally converted to a histogram-type prior on 0 via equation (2.7), for 
use in computing its posterior distribution. ■ 


2.2.2 Conjugate priors 

In choosing a prior belonging to a specific distributional family 7t(0|t/) some 
choices may be more computationally convenient than others. In particular, 
it may be possible to select a member of that family which is conjugate to 
the likelihood/(yI Q ), that is, one that leads to a posterior distribution 
p(Q\y) belonging to the same distributional family as the prior. Morris 
(1983b) showed that exponential families, from which we typically draw our 
likelihood functions, do in fact have conjugate priors, so that this approach 
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will often be available in practice. The computational advantages are best 
illustrated through an example. 


Example 2.5 Suppose that X is the number of pregnant women arriving 
at a particular hospital to deliver their babies during a given month. The 
discrete count nature of the data plus its natural interpretation as an arrival 
rate suggest adopting a Poisson likelihood. 


f(x\e) = 


X\ 


x e {o,i, 2 ,...}, e > o. 


To effect a Bayesian analysis we require a prior distribution for 0 having 
support on the positive real line. A reasonably flexible choice is provided 
by the gamma distribution, 


w(0) = 


ga-l e -9/p 

T(a)ft a 


9 > 0,o > 0, ft > 0, 


or 0 ~ G(a, ft) in distributional shorthand. Note that we have suppressed 
7 r's dependence on rj = (a, ,i )since we assume it to be known. The gamma 
distribution has mean aft variance ft 2 , and can have a shape that is either 
one-tailed (a < 1) or two-tailed (q > 1). (The scale of the distribution is 
affected by ft but its shape is not.) Using Bayes' Theorem (2.1) to obtain 
the posterior density, we have 


p(6\x) oc f(x\9)n(0) 
oc (e~V) 

_ QX+a— l g —0(l+l//3) _ (2.8) 


Notice that since our intended result is a normalized function of 9 we are 
able to drop any multiplicative functions that do not depend on 9. (For 
example, in the first line we have dropped the marginal distribution m(x) 
in the denominator, since it is free of 9 ) But now looking at (2.8), we see 
that it is proportional to a gamma distribution with parameters a' = x + a 
and ft 1 = (1 + ] / ft )“ 1 . Since this is the only function proportional to (2.8) 
that integrates to 1 and density functions uniquely determine distributions, 
we know that the posterior distribution for 0 is indeed G(a', ft'), and that 
the gamma is the conjugate family for the Poisson likelihood. ■ 

In this example, we might have guessed the identity of the conjugate 
prior by looking at the Poisson likelihood as a function of 9 instead of 
x as we typically do. Such a viewpoint reveals the 9 a e~ be structure that 
is also the basis of the gamma density. This technique is useful for deter¬ 
mining conjugate priors in many exponential families, as we shall see as 
we continue. Since conjugate priors can permit posterior distributions to 
emerge without numerical integration, we shall use them in many of our 
subsequent examples, in order to concentrate on foundational concerns and 
defer messy computational issues until Chapter 5. 
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For multiparameter models, independent conjugate priors may often be 
specified for each parameter, leading to corresponding conjugate forms for 
each conditional posterior distribution. 

Example 2.6 Consider once again the normal likelihood of Example 2.1, 
where now both 9 and a 2 are assumed unknown, so that 9 = (9, a 2 ). 
Looking at f as a function of 9 alone, we see an expression proportional 
to exp(-(0 - a) 2 /b), so that the conjugate prior is again given by the 
normal distribution. Thus we assume that wi(9) = N(9\p,r 2 ) as be¬ 
fore. Next, looking at f as a function of a, we see a form proportional 
to ( <j 2 )~ a e~ i l a ~ which is vaguely reminiscent of the gamma distribution. 
In fact, this is actually the reciprocal of a gamma: if X ~ G(a, 0), then 
Y= l/X ~ Inverse Gamma(a, j3) = IG(a,l3), with density function (see 
Appendix Section A.2) 

e -i//3y 

9(y\ a >0) = r(aj^ T yV+i ’ 2/ > 0, a > 0 , /? > 0. 

Thus we assume that w 2 (cr 2 ) = IG(a 2 \a, b ), where a and b are the shape 
and scale parameters, respectively. Finally, we assume that 9 and a 2 are a 
priori independent, so that n (9) = tt , (61) 7T2 (cr 2 ). With this componentwise 
conjugate prior specification, the two conditional posterior distributions 
emerge as expression (2.2) and 


p(a 2 \x, 9) = IG 







(2.9) 


We remark that in the normal setting, closed forms may also emerge for 
the marginal posterior densitiesp(0|;r) and p(a‘ 2 \x); we leave the details 
as an exercise. In any case, the ability of conjugate priors to produce at 
least unidimensional conditional posteriors in closed form enables them to 
retain their importance even in very high dimensional settings. This occurs 
through the use of Markov chain Monte Carlo integration techniques, which 
construct a sample from the joint posterior by successively sampling from 
the individual conditional posterior distributions. 

Finally, while a single conjugate prior may be inadequate to accurately 
reflect available prior knowledge, a finite mixture of conjugate priors may 
be sufficiently flexible (allowing multimodality, heavier tails, etc.) while 
still enabling simplified posterior calculations. For example, suppose we 
are given the likelihood f(x\9) for the univariate parameter 6. and adopt 
the two-component mixture prior 7t(0) = aTti(6) + (l-a)n 2 ( 6 ), 0 < a < 1, 
where both rr\ and n 2 belong to the family of priors conjugate for f. Then 

f(x\9)ni(6)a + f(x\$)ir 2 (6){l - a) 
f[f(x\9)Tr 1 (9)a + f(x\0)ir 2 (0)(l - a)]d6 
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/(ar|g)«-!(fl') 

mi(i) 


ini(x)a. + 


f(x\e)w 2 ( 0 ) 

rn^(x) 


m 2 (x)( 1 - a) 


m 1 (x)a + m 2 (x)( 1 - a) 

f(x\ff) n &) , /(* \0)* 2 (0) 

— -7 \—"t- 7 " '< w 2 

rni(x) m 2 (x) 

= Pi(0\x)w 1 + p 2 (d\x)w 2 , 


where uq = mi(x)a/{mi(x)a + m 2 (x)(l — a)] and w 2 = 1 — W\. Hence a 
mixture of conjugate priors leads to a mixture of conjugate posteriors. 


2.2.3 Noninformative Priors 

As alluded to earlier, often no reliable prior information concerning 0 ex¬ 
ists, or an inference based solely on the data is desired. At first, it might 
appear that Bayesian inference would be inappropriate in such settings, 
but this conclusion is a bit hasty. Suppose we could find a distribution 
7r(0) that contained "no information" about 8 in the sense that it did not 
favor one 6 value over another (provided both values were logically possi¬ 
ble). We might refer to such a distribution as a noninfonnative prior for 
0, and argue that all of the information resulting in the posterior p(0|x) 
arose from the data, and hence all resulting inferences were completely ob¬ 
jective, rather than subjective. Such an approach is likely to be important 
if Bayesian methods are to compete successfully in practice with their pop¬ 
ular likelihood-based counterparts (e.g., maximum likelihood estimation). 
But is such a "noninformative approach" possible? 

In some cases, the answer is an unambiguous "yes." For example, suppose 
the parameter space is discrete and finite, i.e., 0 — \0 l Then 

clearly the distribution 

p(0i) - 1/n, i=l,...,n, 

does not favor any one of the candidate 0 values over any other, and, as 
such, is noninformative for 6. If instead we have a bounded continuous 
parameter space, say 0 - [a, 6], -oo < a < b < oo, then the uniform 
distribution 

p(0) - 1/(6 - a), a < 9 < 6, 

is arguably noninformative for 9 (though this conclusion can be questioned, 
as we shall see later in this subsection and in the example of Subsec¬ 
tion 2.3.4). 

When we move to unbounded parameter spaces, the situation is even 
less clear. Suppose that 0 — (—oo,oo). Then the appropriate uniform 
prior would appear to be 

p(6) = c, any c > 0. 

But this distribution is improper, in that f p(0)d9 = oo. and hence does not 
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appear appropriate for use as a prior. But even here, Bayesian inference is 
still possible if the integral with respect to 9 of the likelihood /(x|0) equals 
some finite value K. (Note that this will not necessarily be the case: the 
definition of f as a joint probability density requires only the finiteness of 
its integral with respect to x.) Then 


P(0|x) 


/(*!#) -g 

f f(xj9) ■ c d9 


/(*!<?) 

K 


Since the integral of this function with respect to 9 is 1, it is indeed a proper 
posterior density, and hence Bayesian inference may proceed as usual. Still, 
it is worth reemphasizing that care must be taken when using improper 
priors, since proper posteriors will not always result. A good example (to 
which we return in Chapter 5 in the context of computation for high¬ 
dimensional models) is given by random effects models for longitudinal 
data, where the dimension of the parameter space increases with the sample 
size. In such models, the information in the data may be insufficient to 
identify all the parameters and, consequently, at least some of the prior 
distributions on the individual parameters must be informative. 

Noninformative priors are closely related to the notion of a reference 
prior, though precisely what is meant by this term depends on the author. 
Kass and Wasserman (1996) define such a prior as a conventional or "de¬ 
fault" choice - not necessarily noninformative, but a convenient place to 
begin the analysis. The definition of Bernardo (1979) stays closer to non- 
informativity, using an expected utility approach to obtain a prior which 
represents ignorance about 9 in some formal sense. Unfortunately, using 
this approach (and others that are similar), a unique representation for the 
reference prior is not always possible. Box and Tiao (1973, p. 23) suggest 
that all that is important is that the data dominate whatever information 
is contained in the prior, since as long as this happens, the precise form of 
the prior is unimportant. They use the analogy of a properly run jury trial, 
wherein the weight of evidence presented in court dominates whatever prior 
ideas or biases may be held by the members of the jury. 

Putting aside these subtle differences in definition for a moment, we 
ask the question of whether a uniform prior is always a good "reference." 
Sadly, the answer is no, since it is not invariant under reparametrization. 
As a simple example, suppose we claim ignorance concerning a univariate 
parameter 0 defined on the positive real line, and adopt the prior p( 6 
1. 9 > 0. A sensible reparametrization is given by y = g(9) = log($), since 
this converts the support of the parameter to the whole real line. Then the 
prior on 7 is given by p'( 7) = |Jjp(e 7 ), where ./ = dO/dy, the Jacobian of 
the inverse transformation. Here we have 9 — g~ l (y) = e 7 , so that./ = e y 
and hence 

p'(y) = e 7 , -00 < 7 < 00 , 


a prior that is clearly not uniform. Thus, using uniformity as a universal 
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definition of prior ignorance, it is possible that "ignorance about Q” does 
not imply "ignorance about 7 !” A possible remedy to this problem is to 
rely on the particular modeling context to provide the most reasonable 
parametrization and, subsequently, apply the uniform prior on this scale. 

The Jeffreys prior (Jeffreys, 1961, p. 181) offers a fairly easy-to-compute 
alternative that is invariant under transformation. In the univariate case, 
this prior is given by 

p{6) oc [7(0 )] 1/2 , ( 2 . 10 ) 

where 1(0 ) is the expected Fisher information in the model, namely, 

I(e) = -E x \ e J^-log/(x|<?) . (2.11) 

Notice that the form of the likelihood helps to determine the prior (2.10), 
but not the actual observed data since we are averaging over x in equation 
( 2 . 11 ). 

It is left as an exercise to show that the Jeffreys prior is invariant to 1-1 
transformation. That is, in the notation of the previous paragraph, it is 
true that 

m} 1/2 = [m 1/2 \^\, (2.i2) 

so that computing the Jeffreys prior for 7 directly produces the same answer 
as computing the Jeffreys prior for $ and subsequently performing the usual 
Jacobian transformation to the 7 scale. 

In the multiparameter case, the Jeffreys prior is given by 

p(0) oc |/(0)| 1/2 , (2.13) 

where | • | now denotes the determinant, and /(#) is the expected Fisher 
information matrix, having (/-element 

r«(«) = -£* 1 . [ 3 ^ log/(*|«) . 

While (2.13) provides a general recipe for obtaining noninformative pri¬ 
ors, it can be cumbersome to use in high dimensions. A more common 
approach is to obtain a noninformative prior for each parameter individu¬ 
ally, and then form the joint prior simply as the product of these individual 
priors. This action is often justified on the grounds that "ignorance" is con¬ 
sistent with "independence," although since noninformative priors are often 
improper, the formal notion of independence does not really apply. 

We close this subsection with two important special cases often en¬ 
countered in Bayesian modeling. Suppose the density of X is such that 
f(x\9) = f(x — 6). That is, the density involves 0 only through the term 
(x-0). Then 0 is called a location parameter, and/is referred to as a loca¬ 
tion parameter family . Examples of location parameters include the normal 
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mean and Cauchy median parameters. Berger (1985, pp.83-84) shows that, 
for a prior on 0 to be invariant under location transformations (i.e., trans¬ 
formations of the form Y = X + c), it must be uniform over the range of 
9. Hence the noninformative prior for a location parameter is given by 

p{0) = 1, 9 € ft . 

Next, assume instead that the density of X is of the form f(x\a ) = 
where a > 0. Then ir is referred to as a scale parameter, and f is 
a scale parameter family. Examples of scale parameters include the normal 
standard deviation (as our notation suggests), as well as the $ parameters 
in the Cauchy and t distributions and the /3 parameter in the gamma 
distribution. If we wish to specify a prior for a that is invariant under scale 
transformations (i.e., transformations of the form Y = cX. for c > 0), then 
Berger shows we need only consider priors of the form k/a, for k > 0. 
Hence the noninformative prior for a scale parameter is given by 

p(cr) = a > 0 . 
a 

If we prefer to work on the variance ( a 2 ) scale, we of course obtain p(cr 2 ) = 
l/cr 2 . Note that, like the uniform prior for the location parameter, both of 
these priors are also improper since J o °° da/a — f" J da 2 /a 2 = oo. 

Finally, many of the most popular distributional forms (including the 
normal, t, and Cauchy) combine both location and scale features into the 
form f(x\0,a) — the so-called location-scale family. The most 

common approach in this case is to use the two noninformative priors given 
above in concert with the notion of prior "independence," obtaining 

p{9, a ) = 0 £ a > 0 . 

a 

As discussed in the exercises, the Jeffreys prior for this situation takes a 
slightly different form. 

2.2.4 Other prior construction methods 

There are a multitude of other methods for constructing priors, both infor¬ 
mative and noninformative; see Berger (1985, Chapter 3) for an overview. 
The only one we shall mention is that of using the marginal distribution 
m(x) - f f(x\9)p(6)d0. Here, we choose the prior p(9) based on its ability 
to preserve consistency with the marginal distribution of the observed data. 
Unlike the previous approaches in this section, this approach uses not only 
the form of the likelihood, but also the actual observed data values to help 
determine the prior. We might refer to this method as empirical estimation 
of the prior, and this is, in fact, the method that is used in empirical Bayes 
(EB) analysis. 

Strictly speaking, empirical estimation of the prior is a violation of 
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Bayesian philosophy: the subsequent prior-to-posterior updating in equa¬ 
tion (2.1) would "use the data twice" (first in the prior, and again in 
the likelihood). The resulting inferences from this posterior would thus 
be "overconfident." Indeed, EB methods that ignore this fact are often 
referred to as naive EB methods. However, much recent work focuses on 
ways to correct these methods for the effect of their estimation of the prior 
- many with a good deal of success. This has enabled EB methods to re¬ 
tain their popularity among practitioners despite the onslaught of advanced 
computational methods designed to tackle the complex integration present 
in (2.3). These EB methods are the subject of Chapter 3. 

2.3 Bayesian inference 

Having specified the prior distribution using one of the techniques in the 
previous section, we may now use Bayes' Theorem (2.1) to obtain the pos¬ 
terior distribution of the model parameters, or perhaps equation (2.5) to 
obtain the predictive density of a future observation. Since these distri¬ 
butions summarize our current state of knowledge (arising from both the 
observed data and our prior opinion, if any), we might simply graph the 
corresponding density (or cumulative distribution) functions and report 
them as the basis for all posterior inference. However, these functions can 
be difficult to interpret and, in many cases, may tell us more than we want 
to know. Hence in this section, we discuss common approaches for summa¬ 
rizing such distributions. In particular, we develop Bayesian analogues for 
the common frequentist techniques of point estimation, interval estimation, 
and hypothesis testing. Throughout, we work with the posterior distribu¬ 
tion of the model parameters, though most of our methods apply equally 
well to the predictive distribution - after all, future data values and param¬ 
eters are merely different types of unknown quantities. We illustrate how 
Bayesian data analysis is typically practiced, and, in the cases of interval 
estimation and hypothesis testing, the philosophical advantages inherent 
in the Bayesian approach. 

2.3.1 Point estimation 

For ease of notation, consider first the univariate case. To obtain a point 
estimate 0 ( y) of 6 , we simply need to select a summary feature of p(0|y), 
such as its mean, median, or mode. The mode will typically be the easiest to 
compute, since no standardization of the posterior is then required: we may 
work directly with the numerator of (2.1). Also, note that when the prior 
7 r(0) is flat, the posterior mode will be equal to the maximum likelihood 
estimate of 0. For this reason, the posterior mode is sometimes referred to 
as the generalized maximum likelihood estimate of 0 . 

For symmetric posterior densities, the mean and the median will of course 
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Figure 2.2 Three point estimates arising from a Gamma(. 5, 1) posterior. 


be identical; for symmetric unimodal posteriors, all three measures will co¬ 
incide. For asymmetric posteriors, the choice is less clear, though the me¬ 
dian is often preferred since it is intermediate to the mode (which considers 
only the value corresponding to the maximum value of the density) and the 
mean (which often gives too much weight to extreme outliers). These dif¬ 
ferences are especially acute for one-tailed densities. Figure 2.2 provides an 
illustration using a G(.5, 1) distribution. Here, the modal value (0) seems 
a particularly poor choice of summary statistic, since it is far from the 
"middle" of the density; in fact, the density is not even finite at this value. 
In order to obtain a measure of the accuracy of a point estimate 9(y) of 
we might use the posterior variance with respect to #(y), 

Eb\ y(0 - H y)) 2 • 

Writing the posterior mean E»\y(9) = p( y) simply as p for the moment, 
we have 

E e \y(0-Hy)) 2 = E e[y (9-p + p-9(y)) 2 

= E«\y[{9 - pf + 2(6» — p)(p - 9( y)) + {p - ^(y)) 2 ] 

= Var e \ y (9) + 2 [p - 0(y)][£ , 9 |y(6') -p] + (p- 9( y)) 2 
= Var e]y {9) + (p- 9( y)) 2 , 
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since the middle term on the third line is identically zero. Hence we have 
shown that the posterior mean fi minimizes the posterior variance with 
respect to 6( y) over all point estimators 9{y). Furthermore, this minimum 
value is Varg\y(9) = Eg |y[$ — Eg |y(#)] 2 the posterior variance of 9 (usually 
referred to simply as the posterior variance). This is a possible argument 
for preferring the posterior mean as a point estimate, and also partially ex¬ 
plains why historically only the posterior mean and variance were reported 
as the results of a Bayesian analysis. 

Turning briefly to the multivariate case, we might again take the pos¬ 
terior mode as our point estimate 0 = (#1 ) though it will now be 

numerically harder to find. When the mode exists, a traditional maximiza¬ 
tion method such as a grid search, golden section search, or Newton-type 
method will often succeed in locating it (see e.g., Thisted, 1988, Chapter 4 
for a description of these and other maximization methods). The posterior 
mean // = Eg^y(9) is also a possibility, since it is still well-defined and its 
accuracy is captured nicely by the posterior covariance matrix. 


V=E, 


0\y 


(0- n){0- n) 


Similar to the univariate case, one can show that 


E e\y 


(0-0(y))(0-0(y))' 


v + (n - 9(y))(n - 6{y))' , 


so that the posterior mean "minimizes" the posterior covariance matrix 
with respect to 9{y). The posterior median is more difficult to define 
in multiple dimensions, though it is still often used in one-dimensional 
subspaces of such problems, i.e., as a summary statistic for »(f?i|y) = 
/ •••/?(%) d0 2 ,...,d9 k . 


Example 2.7 Perhaps the single most important contribution of statis¬ 
tics to the field of scientific inquiry is the general linear model, of which 
regression and analysis of variance models are extremely widely-used spe¬ 
cial cases. A Bayesian analysis of this model was first presented in the 
landmark paper by Lindley and Smith (1972), which we summarize here. 
Suppose that Y|0i ~ N(Ai0 x , C\) where Y is an n x 1 data vector, 0, is 
a/jjxl parameter vector, A\ is an n X pi known design matrix, and C\ is 
an n x n known covariance matrix. Suppose further that we adopt the prior 
distribution 9\ ~ N(A 2 8 2 , C 2 ). where 0 2 is ap 2 x 1 parameter vector, A 2 
is a pi x p 2 design matrix, C 2 is a pi x pi covariance matrix, and 0 2 , A 2 , 
and C 2 are all known. Then the marginal distribution of Y is 

Y ~N(A l A 2 02, Ci + AiC^i), 

and the posterior distribution of 0j is 

0i|y ~ N(Dd , D) , (2.14) 
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where 


D- 1 =A' 1 C^ l A i + C^ 1 , 


(2.15) 


and 

d = A' 1 C^ 1 y + C^A 2 0 2 . (2.16) 

Thus E(9 i|y) = Dd provides a point estimate for 6\ with associated 
variability captured by the posterior covariance matrix Var(<?i y) = D. 

As a more concrete illustration, we consider the case of linear regression. 
Using the usual notation, we would set Ai = X, 6\ = (3, and C\ — crl n . A 
noninformative prior is provided by taking Cj 1 = 0, i.e., setting the prior 
precision matrix equal to a pj x p\ matrix of zeroes. Then from equations 
(2.15) and (2.16), we have 

D~ l = X\<r 2 I n )- l X + 0 = ^{X'X) , 

and 

d = X’(a 2 I n )- 1 y + 0=±(X'y), 

(7 

so that the posterior mean is given by 
1 1 1 1 

Dd= —(X'X) -(X>y) = (X>X)- 1 X'y=(3 LS , 

<r z J <y i 

the usual least squares estimate of j3. From (2.14), the posterior distribution 
of (3 is 

fty-Nip^v^X'X)- 1 ). 

Recall that the sampling distribution of the least squares estimate is given 
by (3 ls |/3 ~ N ( ft. cr 2 (X'X)~ l ) so that classical and noninformative Bayes¬ 
ian inferences regarding f3 will be formally identical in this example. 

If C\ = Var(Y\0 ]) is unknown, a closed form analytic solution for the 
posterior mean of f3 will typically be unavailable. Fortunately, sampling- 
based computational methods are available to escape this unpleasant situ¬ 
ation, as we shall illustrate later in Example 5.6. ■ 

2.3.2 Intenml estimation 

The Bayesian analogue of a frequentist confidence interval (Cl) is usually 
referred to as a credible set, though we will often use the term "Bayesian 
confidence interval" or simply "confidence interval" in univariate cases for 
consistency and clarity. More formally. 

Definition 2.1 A 100 x (1 - a) % credible set for 6 is a subset C of 0 
such that 

1 - a < P(C|y) = f p(d\y)d6 , 

Jc 

where integration is replaced by summation for discrete components of 6. 
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This definition enables direct probability statements about the likelihood 
of 9 falling in C, i.e., 

"The probability that 0 lies in C given the observed data y is at least (1 — a).” 

This is in stark contrast to the usual frequentist Cl, for which the corre¬ 
sponding statement would be something like, 

"If we could recompute C for a large number of datasets collected in the same 

way as ours, about (1 - a) x 100% of them would contain the true value of 6.” 

This is not a very comforting statement, since we may not be able to even 
imagine repeating our experiment a large number of times (e.g., consider an 
interval estimate for the 1993 U.S. unemployment rate). In any event, we 
are in physical possession of only one dataset; our computed C will either 
contain 0 or it won't, so the actual coverage probability will be either 1 or 
0. Thus for the frequentist, the confidence level (1- a is only a "tag" that 
indicates the quality of the procedure (i.e., a narrow 95% Cl should make 
us feel better than an equally narrow 90% one). But for the Bayesian, the 
credible set provides an actual probability statement, based only on the 
observed data and whatever prior opinion we have added. 

We used “<” instead of "=" in Definition 2.1 in order to accommodate 
discrete settings, where obtaining an interval with coverage probability ex¬ 
actly (1 - a) may not be possible. Indeed, in continuous settings we would 
like credible sets that do have exactly the right coverage, in order to mini¬ 
mize their size and thus obtain a more precise estimate. More generally, a 
technique for doing this is given by the highest posterior density, or HPD, 
credible set, defined as the set 

C = {9 e & : p(B\y) > k(a)} , 

where k(a) is the largest constant satisfying 

P(C|y)>l—a. 

A good way to visualize such a set C is provided by Figure 2.3, which 
illustrates obtaining an HPD credible interval for $ assumed to follow a 
G(2, 1) posterior distribution. Here, drawing a horizontal line across the 
graph at p( 9 \y) =0.1 results in a 87% HPD interval of (0.12, 3.59); see 
the solid line segments in the figure. That is, P(0.12 < 6 < 3.59ly) = .87 
when 0|y ~ G( 2, 1). We could obtain a 90% or 95% HPD interval simply 
by "pushing down" the horizontal line until the corresponding values on 
the 0-axis trapped the appropriate, larger probability. 

Such a credible set is very intuitively appealing since it groups together 
the "most likely" 9 values and is always computable from the posterior 
density, unlike classical CIs which frequently rely on "tricks" such as pivotal 
quantities. However, the graphical "line pushing" mentioned in the previous 
paragraph translates into iteratively solving the nonlinear equation 

P(C(fc(«))|y) = 1 - « , 
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Figure 2.3 HPD and equal-tail credible intervals arising from a Gamma(2,l } pos¬ 
terior using k(a) = 0.1. 


in order to find C for a prespecified a level. This calculation is such a 
nuisance that we often instead simply take the a/2- and (1 — a/2)-quantiles 
of p(0 jy) as our 100 x (1 - a)% credible set for 0. This equal tail credible 
set will be equal to the HPD credible set if the posterior is symmetric and 
unimodal, but will be a bit wider otherwise. 

In the context of our Figure 2.3 example, we obtain an 87% equal tail 
credible set of (0.42, 4.39); see the dashed line segments in the figure. By 
construction, we have 

P(0.42 < 0 < 4.39|y) = 1 - P(6 < 0.42|y) - P(6 > 4.39|y) 

= 1- a/2 -a/2 

= 1 — a = 0.87. 

Due to the asymmetry of our assumed posterior distribution, this interval 
is 0.5 units wider than the corresponding HPD interval, and includes some 
larger 0 values that are "less likely" while omitting some smaller 0 values 
that are "more likely". Still, this interval is trivially computed for any a 
using one of several statistical software packages and seems to satisfactorily 
capture the "middle" of the posterior distribution. Moreover, this interval 
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is invariant to 1 -1 transformations of the parameter; for example, the logs of 
the interval endpoints above provide an 87% equal tail interval for r) = log 6. 

For a typical Bayesian data analysis, we might summarize our findings by 
reporting (a) the posterior mean and mode, (b) several important posterior 
percentiles (corresponding to, say, probability levels .025, .25, .50, .75, and 
.975), (c) a plot of the posterior itself if it is multimodal, highly skewed, 
or otherwise badly behaved, and possibly (d) posterior probabilities of the 
form P(0 > c|y) or P(f < c|y), where c is some important reference point 
(e.g., 0) that arises from the context of the problem. We defer an illustration 
of these principles until after introducing the basic concepts of Bayesian 
hypothesis testing. 


2.3.3 Hypothesis testing and Bayes factors 

The comparison of predictions made by alternative scientific explanations 
is, of course, a mainstay of current statistical practice. The traditional 
approach is to use the- ideas of Fisher, Neyman, and Pearson, wherein one 
states a primary, or null, hypothesis Ho , and an alternative hypothesis H a . 
After determining an appropriate test statistic T(Y), one then computes 
the obsen'ed significance, or p-value, of the test as 

/7-value = P{T(Y) more "extreme" than T{y 0 t, s )\6 . H 0 } , (2.17) 

where "extremeness" is in the direction of the alternative hypothesis. If 
the />-value is less than some prespecified Type 1 error rate. Ho is rejected; 
otherwise, it is not. 

While classical hypothesis testing has a long and celebrated history in 
the statistical literature and continues to be the overwhelming favorite of 
practitioners, several fairly substantial criticisms of it may be made. First, 
the approach can be applied straightforwardly only when the two hypothe¬ 
ses in question are nested, one within the other. That is,Ho must normally 
be a simplification of H a (say, by setting one of the model parameters in 
H a equal to a constant - usually zero). But many practical hypothesis test¬ 
ing problems involve a choice between two (or more) models that are not 
nested (e.g., choosing between quadratic and exponential growth models, 
or between exponential and lognormal error distributions). 

A second difficulty is that tests of this type can only offer evidence against 
the null hypothesis. A small /7-value indicates that the larger, alternative 
model has significantly more explanatory power. However, a large /7-value 
does not suggest that the two models are equivalent, but only that we lack 
evidence that they are not. This difficulty is often swept under the rug in 
introductory statistics courses, the technically correct phrase "fail to reject 
[the null hypothesis]" being replaced by "accept." 

Third, the /7-value itself offers no direct interpretation as a "weight of 
evidence," but only as a long-term probability (in a hypothetical repetition 
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of the same experiment) of obtaining data at least as unusual as what 
was actually observed. Unfortunately, the fact that small p-values imply 
rejection of Ho causes many consumers of statistical analyses to assume 
that the /r-value is "the probability that Ho is true," even though (2.17) 
shows that it is nothing of the sort. 

A final, somewhat more philosophical criticism of /r-values is that they 
depend not only on the observed data, but also the total sampling proba¬ 
bility of certain unobserved datapoints; namely, the "more extreme" T(Y) 
values in equation (2.17). As a result, two experiments with identical likeli¬ 
hoods could result in different p-values if the two experiments were designed 
differently. As mentioned previously in conjunction with Example 1.2, this 
fact violates a proposition known as the Likelihood Principle (Birnbaum, 
1962), which can be stated briefly as follows: 

The Likelihood Principle: In making inferences or decisions about 9 after y 

is observed, all relevant experimental information is contained in the likelihood 

function for the observed y. 

By taking into account not only the observed data y, but also the unob¬ 
served but more extreme values of Y classical hypothesis testing violates 
the Likelihood Principle. 

This final criticism, while theoretically based, has practical import as 
well. For instance, it is not uncommon for unforeseen circumstances to 
arise during the course of an experiment, resulting in data that are not ex¬ 
actly what was intended in the experimental design. Technically speaking, 
in this case a frequentist analysis of the data is not possible, since p-values 
are only computable for data arising from the original design. Problems 
like this are especially common in the analysis of clinical trials data, where 
interim analyses and unexpected drug toxicities can wreak havoc with the 
trial's design. Berger and Berry (1988) provide a lucid and nontechnical dis¬ 
cussion of these issues, including a clever example showing that a classically 
significant trial finding can be negated merely by contemplating additional 
trial stages that did not occur. 

By contrast, the Bayesian approach to hypothesis testing, due primar¬ 
ily to Jeffreys (1961), is much simpler and more sensible in principle, and 
avoids all four of the aforementioned difficulties with the traditional ap¬ 
proach. As in the case of interval estimation, however, it requires the notion 
of subjective probability to facilitate its probability calculus. Put succinctly, 
based on the data that each of the hypotheses is supposed to predict, one 
applies Bayes' Theorem and computes the posterior probability that the 
first hypothesis is correct. There is no limit on the number of hypothe¬ 
ses that may be simultaneously considered, nor do any need to be nested 
within any of the others. As such, in what follows, we switch notation from 
"hypotheses" Ho and H a to "models" Mi, i = 1,..., m. 

To lay out the mechanics more specifically, suppose we have two can- 
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didate parametric models M\ and M 2 for data Y and the two models 
have respective parameter vectors 0\ and 9 2 . Under prior densities 7r,(0,), 
i = l,2, the marginal distributions of Y are found by integrating out the 
parameters. 


p(y\Mi) = j /(yl^M)^)^ , i = 1,2 . (2.18) 

Bayes' Theorem (2.1) may then be applied to obtain the posterior prob¬ 
abilities P(Mi|y) and P(M 2 |y) — l- P(Mi|y) for the two models. The 
quantity commonly used to summarize these results is the Bayes factor, 
BF, which is the ratio of the posterior odds of M\ to the prior odds of Mi, 
given by Bayes' Theorem as 


BF 


P(Mi\y)/P(M 2 \y) 

P(Mi)/P(M 2 ) 

' pjylM^PjM^ ] , \ P (y\M 2 )P(M 2 ) 

p( y) 1 p( y) 

P(Mi)/P(M 2 ) 

p {y I Mi) 

p (y \m 2 ) ’ 


(2.19) 


( 2 . 20 ) 


the ratio of the observed marginal densities for the two models. Assuming 
the two models are a priori equally probable (i.e.,P(Mi) = P(M 2 ) = 0.5), 
we have that BF = P(Mj|y)/P(M 2 |y), the posterior odds of Mi- 

Consider the case where both models share the same parametrization 
(i.e. 6i=9 2 = 9), and both hypotheses are simple (i.e., M\ : 6 = 9^ 1 * and 
M 2 : 9 = 9^). Then 7r,-(0) consists of a point mass at 9 ^ for i = 1, 2, and 
so from (2.18) and (2.20) we have 


BF = 


/(ylfl (1) ) 

/(y|0 (2) ) 


which is nothing but the likelihood ratio between the two models. Hence, 
in the simple-versus-simple setting, the Bayes factor is precisely the odds in 
favor of Mi over M 2 given solely by the data. For more general hypotheses, 
this same "evidence given by the data" interpretation of BF is often used, 
though Lavine and Schervish (1999) show that a more accurate interpreta¬ 
tion is that BF captures the change in the odds in favor of model 1 as we 
move from prior to posterior. In any event, in such cases, BF does depend 
on the prior densities for the 6i , which must be specified either as conve¬ 
nient conjugate forms or by more careful elicitation methods. In this case, 
it becomes natural to ask whether "shortcut" methods exist that provide a 
rough measure of the evidence in favor of one model over another without 
reference to any prior distributions. 

One answer to this question is given by Schwarz (1978), who showed that 
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( 2 . 21 ) 


for large sample sizes n, an approximation to -2 log BF is given by 


ABIC = W- (p 2 -pi)logn , 


where pi is the number of parameters in model Mi ,i = 1,2, and 


W = -2 log 


' sup Ml f(y\o) ' 

. su pm 2 /(y|0). 


the usual likelihood ratio test statistic. BIC stands for Bayesian Informa¬ 
tion Criterion (though it is also known as the Schwarz Criterion), and A 
denotes the change from Model 1 to Model 2. The second term in ABIC 
acts as a penalty term which corrects for differences in size between the 
models (to see this, think of M 2 as the "full" model and M\ as the "re¬ 
duced" model). Thus exp(— 2 A BIC) provides a rough approximation to 
the Bayes factor which is independent of the priors on the 

There are many other penalized likelihood ratio model choice criteria, 
the most famous of which is the A kaike Information Criterion, 


AAIC = W — 2(p 2 -Pi) ■ (2.22) 


This criterion is derived from frequentist asymptotic efficiency considera¬ 
tions (where both n and the pi approach infinity). This statistic is fairly 
popular in practice, but approaches -2 log BF asymptotically only if the 
information in the prior increases at the same rate, as the information in 
the likelihood, an unrealistic assumption in practice. Practitioners have also 
noticed that A/C tends to "keep too many terms in the model," as would 
be expected from its smaller penalty term. 

While Bayes factors were viewed for many years as the only correct way 
to carry out Bayesian model comparison, they have come under increasing 
criticism of late. The most serious difficulty in using them is quite easy to 
see: if 7 r,-( 0 ,-) is improper (as is typically the case when using noninformative 
priors), then p(y|M;) = f /(y|0,, M,-)7r,-(0,)d0,- necessarily is as well, and 
so BF as given in (2.20) is not well-defined. Several authors have attempted 
to modify the definition of BF to repair this deficiency. One approach is 
to divide the data into two parts, y = (yi,y 2 ) and use the first portion 
to compute model-specific posterior densities p(0j|yi), i = 1, 2. Using these 
two posteriors as the priors in equation (2.20), the remaining data y 2 then 
produces the partial Bayes factor, 


BF( y 2 |yi) 


p(y2|yi,Mi) 

p(y2|yi, m 2 ) 


While this solves the nonexistence problem in the case of improper priors, 
there remains the issue of how many datapoints to allocate to the "training 
sample" yj and how to select them. Berger and Pericchi (1996) propose 
using an arithmetic or geometric mean of the partial Bayes factors obtained 
using all possible minimal training samples (i.e., where n 1 is taken to be 
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the smallest sample size leading to proper posteriors p{8 l |yij). Since the 
result corresponds to an actual Bayes factor under a reasonable intrinsic 
prior, they refer to it as an intrinsic Bayes factor. O'Hagan (1995) instead 
recommends a fractional Bayes factor, defined as 


BF b (y) = 


Ky> fe |Mi) 

p(y,b\M 2 ) 


where b = iii/n. ni is again a "minimal" sample size, and 


p{y,b\Mi) 


f f(y\6i,Mi)TTi(8i)d8i 


O'Hagan (1995, 1997) also suggests other, somewhat larger values for b , 
such as b — max(ni,logn)/n and b = max(ni, y/nj/n, on the grounds that 
this should reduce sensitivity to the prior. 

Note that while the intrinsic Bayes factor uses part of the data to rectify 
the impropriety in the priors, the fractional Bayes factor instead uses a 
fraction b of the likelihood function for this purpose. O'Hagan (1997) offers 
asymptotic justifications for B F b (y) as well as comparison with various 
forms of the intrinsic BF; see also DeSantis and Spezzaferri (1999) for 
further comparisons, finite-sample motivations for using the fractional BF, 
and open research problems in this area. 


2.3.4 Example: Consumer preference data 

We summarize our presentation on Bayesian inference with a simple ex¬ 
ample, designed to illustrate key concepts in the absence of any serious 
computational challenges. Suppose sixteen consumers have been recruited 
by a fast food chain to compare two types of ground beef patty on the basis 
of flavor. All of the patties to be evaluated have been kept frozen for eight 
months, but one set of 16 has been stored in a high-quality freezer that 
maintains a temperature that is consistently within one degree of 0 degrees 
Fahrenheit. The other set of 16 patties has been stored in a freezer wherein 
the temperature varies anywhere between 0 and 15 degrees Fahrenheit. 
The food chain executives are interested in whether storage in the higher- 
quality freezer translates into a substantial improvement in taste (as judged 
by the consumers), thus justifying the extra effort and cost associated with 
equipping all of their stores with these freezers. 

In a test kitchen, the patties are defrosted and prepared by a single 
chef. To guard against any possible order bias (e.g., it might be that a 
consumer will always prefer the first patty he or she tastes), each consumer 
is served the patties in a random order, as determined by the chef (who has 
been equipped with a fair coin for this purpose). Also, the study is double- 
blind: neither the consumers nor the servers know which patty is which; 
only the chef (who is not present at serving time) knows and records this 
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Figure 2.4 Prior distributions used in analyzing the consumer preference data. 


information. This is also a measure to guard against possible bias on the 
part of the consumers or the servers (e.g., it might be that the server talks 
or acts differently when serving the cheaper patty). After allowing sufficient 
time for the consumers to make up their minds, the result is that 13 of the 
16 consumers state a preference for the more expensive patty. 

In order to analyze this data, we need a statistical model. In Bayesian 
analysis, this consists of two things: a likelihood and a prior distribution. 
Suppose we let 0 be the probability that a consumer prefers the more 
expensive patty. Let Y, = I if consumer i states a preference for the more 
expensive patty, and V) = 0 otherwise. If we assume that the consumers 
are independent and that 0 is indeed constant over the consumers, then 
their decisions form a sequence of Bernoulli trials. Defining X = Y t , 

we have 


X\0 ~ Binomial(l6, 0) . 

That is, the sampling density for x is f(x\0) = ( 6 X (1 - 0) l6 ~ x . 

Turning to the selection of a prior distribution, looking at the likelihood 
as a function of 0, we see that the beta distribution offers a conjugate 
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family, since its density function is given by 


tt(6) = 


r(a + /?) a _ l 

T(a)m 


(i-ef - 1 . 


Before adopting this family, however, we must ask ourselves if it is suffi¬ 
ciently broad to include our true prior feelings in this problem. Fortunately 
for us, the answer in this case seems to be yes, as evidenced by Figure 2.4. 
This figure shows three different members of the beta family, all of which 
might be appropriate. The solid line corresponds to the Beta(. 5, .5) distri¬ 
bution, which is the Jeffreys prior for this problem (see Section 2.6). We 
know this prior is noninformative in a transformation-invariate sense, but 
it seems to provide extra weight to extreme values of 9. The dotted line 
displays what we might more naturally think of as a noninformative prior, 
namely, a flat (uniform) prior, obtained by setting a — /? = 1 - Finally, 
the dashed line follows a Beta(2, 2) prior, which we might interpret as re¬ 
flecting mild skepticism of an improvement in taste by the more expensive 
patty (i.e., it favors values of 6 near .5). The beta family could also provide 
priors that are not symmetric about .5 (by taking a ^ /?), if the fast food 
chain executives had strong feelings or the results of previous studies to 
support such a conclusion. However, we shall restrict our attention to the 
three choices in Figure 2.4, since they seem to cover an adequate range of 
"minimally informative" priors which would not be controversial regard¬ 
less of who is the ultimate decision-maker (e.g., the corporation's board of 
directors). 

Thanks to the conjugacy of our prior family with the likelihood, the 
posterior distribution for 9 is easily calculated as 


p(6\x) oc f(x\6)Tr(9) 

(X r+“—^l-f ?) 16 -^— 1 

oc Beta(x + a, 16 — x + /?) . 

The three beta posteriors corresponding to our three candidate priors are 
plotted in Figure 2.5, with corresponding posterior quantile information 
summarized in Table 2.1. The three posteriors are ordered in the way we 
would expect, i.e., the evidence that 9 is large decreases as the prior becomes 
more "skeptical" (concentrated near 1/2). Still, the differences among them 
are fairly minor, suggesting that despite the small sample size (n =16), 
the results are robust to subtle changes in the prior. The posterior medians 
(which we might use as a point estimate of 6 are all near 0.8; posterior 
modes (which would not account for the left skew in the distributions) 
would be even larger. In particular, the posterior mode under the flat prior 
is 13/16 = 0.8125, the maximum likelihood estimate of 9. Also note that 
all three posteriors produce 95% equal-tail credible intervals that exclude 
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Figure 2.5 Posterior distributions arising from the three prior distributions cho¬ 
sen for the consumer preference data. 

0.5, implying that we can be quite confident that the more expensive patty 
offers an improvement in taste. 

To return to the executives' question concerning a substantial improve¬ 
ment in taste, we must first define "substantial." We (somewhat arbitrarily) 
select 0.6 as the critical value which 6 must exceed in order for the improve¬ 
ment to be regarded as "substantial." Given this cutoff value, it is natural 
to compare the hypotheses M\ : 6 > 0.6 and M 2 '■ 0 < 0.6. Using the 
uniform prior, the Bayes factor in favor of M\ is easily computed using 


Prior Posterior quantile 

distribution .025 .500 .975 P(6 > .6|x) 

Beta(. 5, .5) 0.579 0.806 0.944 0.964 

Beta(l, 1) 0.566 0.788 0.932 0.954 

Beta( 2,2) 0.544 0.758 0.909 0.930 

Table 2.1 Posterior summaries, consumer preference data. 
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equation (2.19) and Table 2.1 as 

0.954/0.046 

BF ~ 0.4/0.6 - 3L1 ’ 

fairly strong evidence in favor of a substantial improvement in taste. Jef¬ 
freys suggested judging BF in units of 0.5 on the common-log-odds scale. 
Since log in 31.1 ~ 1.5, this rule also implies a strong preference for Mi. 


2.4 Model assessment 


2.4.1 Diagnostic measures 


Flaving successfully fit a model to a given dataset, the statistician must 
be concerned with whether or not the fit is adequate and the assumptions 
made by the model are justified. For example, in standard linear regression, 
the assumptions of normality, independence, linearity, and homogeneity of 
variance must all be investigated. Several authors have suggested using the 
marginal distribution of the data, m(y), in this regard. Observed Hi values 
for which 777 (t/j) is small are “unlikely” and, therefore, may be considered 
outliers under the assumed model. Too many small values of m( y t ) suggest 
the model itself is inadequate and should be modified or expanded. 

A problem with this approach is the difficulty in defining how small is 
"small," and how many outliers are "too many." In addition, we have the 
problem of the possible impropriety of rn(y) under noninformative pri¬ 
ors, already mentioned in the context of Bayes factors. As such, we might 
instead work with predictive distributions, since they will be proper when¬ 
ever the posterior is. Specifically, suppose we fit our Bayesian model using 
a fitting sample of data z = (Zi,.,z m )' and wish to check it using an 
independent validation sample y = t/i, • •., t/n)'• By analogy with classical 
model checking, we could begin by analyzing the collection of residuals, 
the difference between the observed and fitted values for each point in the 
validation sample. In this context, we define a Bayesian residual as 

n = yi - E(Yi |z), i = 1,..., n. (2.23) 


Plotting these residuals versus fitted values might reveal failure in a nor¬ 
mality or homogeneity of variance assumption; plotting them versus time 
could reveal a failure of independence. Summing their squares or absolute 
values could provide an overall measure of fit, but, to eliminate the depen¬ 
dence on the scale of the data, we might first standardize the residuals, 


computing 


Vi- E(Yj\z) 
^/Var(Yi |z) ’ 


= 1 


(2.24) 


The above discussion assumes the existence of two independent data 
samples, which may well be unavailable in many problems. Of course we 
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could simply agree to always reserve some portion (say, 20-30%) of our 
data at the outset for subsequent model validation, but we may be loathe 
to do this when data are scarce. As such, Gelfand, Dey, and Chang (1992) 
suggest a cross- validatory (or "leave one out") approach, wherein the fitted 
value for is computed conditional on all the data except y i . namely, 
y(a) = (j/l, • • • Vi-l, Vi+i, ■ ■ ■, Vn) 1 - That is > we replace (2.23) by 

r 'i — Vt — £(Ti|y(«)), 

and (2.24) by 

j _ Vi ~ E(Yj |y w ) 
yjV ar(Yi |y(i)) 

Gelfand et al. (1992) provide a collection of alternate definitions for di, 
with associated theoretical support presented in Gelfand and Dey (1994). 

Note that in this cross-validatory approach we compute the posterior 
mean and variance with respect to the conditional predictive distribution, 

f( y i l y (*')) == J f(yi\ d ’y(i))p( d \y(i)) de > (2 - 25 ) 

which gives the likelihood of each point given the remainder of the data. 
The actual values of /(?/; |yq)) (sometimes referred to as the conditional 
predictive ordinate, or CPO) can be plotted versus i as an outlier diagnostic, 
since data values having low CPO are poorly fit by the model. Notice that, 
unlike the marginal distribution, /(t/j |y(q) will be proper if p{0\y(, t ) is. In 
addition, the collection of conditional predictive densities {/(?/,■ lyqj), * = 
1, . . . , n] is equivalent to m(y) when both exist (Besag, 1974), encouraging 
the use of the former even when the latter is undefined. 

An alternative to this cross-validatory approach is due to Rubin (1984), 
who proposes instead posterior predictive model checks. The idea here is 
to construct test statistics or other "discrepancy measures" D(y) that at¬ 
tempt to measure departures of the observed data from the assumed model 
(likelihood and prior distribution). For example, suppose we have fit a nor¬ 
mal distribution to a sample of univariate data, and wish to investigate the 
model's fit in the lower tail. We might compare the observed value of the 
discrepancy measure 

-I7(y) — Vmin 

with its posterior predictive distribution, p(Z)(y*)|y), where y* denotes a 
hypothetical future value of y. If the observed value is extreme relative 
to this reference distribution, doubt is cast on some aspect of the model. 
Gelman et al. (1995) stress that the choice of D(y) can (and should) be 
tuned to the aspect of the model whose fit is in question. However, D( y) 
should not focus on aspects which are parametrized by the model (e.g., 
taking D(y) equal to the sample mean or variance in the above example), 
since these would be fit automatically in the posterior distribution. 
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In order to be computable in the classical framework, test statistics must 
be functions of the observed data alone. But, as pointed out by Gelman, 
Meng, and Stern (1996), basing Bayesian model checking on the posterior 
predictive distribution allows generalized test statistics D{y,9 that de¬ 
pend on the parameters as well as the data. For example, as an omnibus 
goodness-of-fit measure, Gelman et al. (1995, p.172) recommend 


D(y,9) 


^ [ yi - E{Y { |fl)] 2 
ti Var{Yi\9) 


(2.26) 


With 9 varying according to its posterior distribution, we would now com¬ 
pare the distribution of D(y,9 ) for the observed y with that of D(y*, 9) for 
a future observation y* . A convenient summary measure of the extremeness 
of the former with respect to the latter is the tail area, 


Pd = P[D(y*,9)> D(y,9)\y] 

= j P[D(y*, 9) > D( y, 9)\9 ] p(9\y)d9 . (2.27) 

In the case where D(y*, 9 )'s distribution is free of 9 (in frequentist parlance, 
when D(y*,9) is a pivotal quantity), po is exactly equal to the frequentist 
p-value, or the probability of seeing a statistic as extreme as the one actually 
observed. As such, (2.27) is sometimes referred to as the Bayesian p-value. 

Given our earlier rejection of the p-value (in Example 1.2 and elsewhere), 
primarily for its failure to obey the Likelihood Principle, our current in¬ 
terest in a quantity that may be identical to it may seem incongruous. 
But as stressed by Meng (1994), it is important to remember that we are 
not advocating the use of (2.27) in model choice: Pd is not the probability 
that the model is "correct," and Bayesian p-values should not be compared 
across models. Rather, they serve only as measures of discrepancy between 
the assumed model and the observed data, and hence provide information 
concerning model adequacy. Even on this point there is controversy, since, 
strictly speaking, this model checking strategy uses the data twice: once to 
compute the observed statistic D(y, 9), and again to obtain the posterior 
predictive reference distribution. It is for this reason that, when sufficient 
data exist, we prefer the approach that clearly separates the fitting sample 
z from the validation sample y. Then we could find 


p' D = P[D(y*,9) > D(y, 9)\z\ 

= j P[D(y*, 9) > D( y, 9)\9]p(9\z)d9 . (2.28) 

Still, (2.27) could be viewed merely as an approximation of the type found 
in empirical Bayes analyses; see Chapter 3. Moreover, the generalized form 
in (2.27) accommodates nuisance parameters and provides a Bayesian prob- 
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ability interpretation in terms of future datapoints, thus somewhat reha¬ 
bilitating the p-value. 


2.4.2 Model averaging 


The discussion so far in this section suggests that the line between model 
adequacy and model choice is often a blurry one. Indeed, some authors 
(most notably, Gelfand, Dey, and Chang, 1992) have suggested cross- 
validatory model adequacy checks as an informal alternative to the more 
formal approach to model choice offered by Bayes factors and their variants. 
We defer details until Subsection 6.5. 

In some problem settings there may be underlying scientific reasons why 
one model or another should be preferred (for example, a spatial distribu¬ 
tion of observed disease cases, reflecting uncertainty as to the precise num¬ 
ber of disease "clusters" or sources of contamination). But in other cases, 
the process of model selection may be somewhat arbitrary in that a number 
of models may fit equally well and simultaneously provide plausible expla¬ 
nations of the data (for example, a large linear regression problem with 20 
possible predictor variables). In such settings, choosing a single model as 
the “correct” one may actually be suboptimal, since further inference con¬ 
ditional on the chosen model would of necessity ignore the uncertainty in 
the model selection process itself. This in turn would lead to overconfidence 
in our estimates of the parameters of interest. 

In principle, the Bayesian approach can handle this difficulty simply by 
replacing model choice with model averaging. Suppose (Mi,..., M m ) again 
index our collection of candidate models and 7 is a quantity of interest, 
assumed to be well-defined for every model. Given a set of prior model 
probabilities |p(Mi),..., p(M rn )}. the posterior distribution of 7 is 

m 

P(7|y) = ^2p(7\Mi, y)p(Mi\y) , (2.29) 

i~l 

where p('/ |M;, y) is the posterior for 7 under the i th model, and p(M* |y) 
is the posterior probability of this model, computable as 


p(Mi |y) = 


p(y\Mi)p(Mj) 

Y,”Up(y\ M j)p( M j) ’ 


where p(y|M, ) is the marginal distribution of the data under the i th model, 
given previously in equation (2.18). Suppose we rank models based on 
the log of their conditional predictive ordinate evaluated at 7 , namely, 
logp( 7 1Mi, y). Using the nonnegativity of the Kullback-Leibler divergence. 
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Madigan and Raftery (1994) show that for each model i = 1, . . . , m. 


E log 


m \ 


> E[logp(j\Mi,y)] , 


where the expectations are with respect to the joint posterior distribution 
of 7 and the M,-s. That is, averaging over all the models results m a better 
log predictive score than using any one of the models individually. 

Despite this advantage, there are several difficulties in implementing a 
solution via equation (2.29). All of them basically stem from the fact that 
the number of potential models m is often extremely large. For example, in 
our 20 -predictor regression setting, m = 2 20 (since each variable can be in 
or out of the model), and this does not count variants of each model arising 
from outlier deletion or variable transformation. Specifying the prior prob¬ 
abilities for this number of models is probably infeasible unless some fairly 
automatic rule is implemented (say, p(Mi) = 1/m for all i) In addition, 
direct computation of the sum in (2.29) may not be possible, forcing us to 
look for a smaller summation with virtually the same result. One possibility 
(also due to Madigan and Raftery, 1994) is to limit the class of models to 
those which meet some minimum level of posterior support, e.g., the class 


say for C — 20. They further eliminate the set B of models that have 
less posterior support than their own submodels, and then perform the 
summation in (2.29) over the set A = A'\B. Madigan and Raftery (1994) 
give a greedy-search algorithm for locating the elements of A, and offer 
several examples to show that their procedure loses little of the predictive 
power available by using the entire model collection. 

An alternative would be to search the entire sample space using Markov 
chain Monte Carlo methods, locating the high-probability models and in¬ 
cluding them in the summation. Since we have not yet presented these 
methods, it is difficult to say much more about them at this time. Indeed, 
many of the predictive quantities described in this section involve compu¬ 
tation of high-dimensional integrals that are routinely evaluated via Monte 
Carlo methods. As such, we defer further discussion of these issues until 
Chapter 6 , after we have developed the computational machinery to enable 
routine calculation of the residuals and predictive summaries in Chapter 5. 
We do however point out that all of the model averaging approaches dis¬ 
cussed so far assume that the search may be done in "model only" space 
(i.e., the parameters 6 can be integrated out analytically prior to beginning 
the algorithm). For searches in "model-parameter" space, matters are more 
complicated; algorithms for this situation are provided by George and Mc- 
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Culloch (1993), Carlin and Chib (1995), Chib (1995), Green (1995), Phillips 
and Smith (1996), Dellaportas et al. (2000), and Godsill (2000). 

2.5 Nonparametric methods 

The previous section describes how we might mix over various models Mi 
in obtaining a posterior or predictive distribution that more accurately 
reflects the underlying uncertainty in the precise choice of model. In a 
similar spirit, it might be that no single parametric model /(y I fj could fully 
capture the variability inherent in _y's conditional distribution, suggesting 
that we average over a collection of such densities. For example, for the 
sake of convenience we might average over normal densities, 

K 

f(y\n, <t 3 , q) = ^ q k <f>(y\y k , a \) , (2.30) 

k=\ 

where p = (/ij,..., fi K ), <r 2 = (aj,,.., a 2 K ), q = (</i ,...,q K ), and 0(.|.,.) 
denotes the normal density function. Equation (2.30) is usually referred to 

as a mixture density, which has mixing probabilities q k satisfying Ik = 

1. Mixture models are attractive since they can accommodate an arbitrarily 
large range of model anomalies (multiple modes, heavy tails, and so on) 
simply by increasing the number of components in the mixture. They are 
also quite natural if the observed population is more realistically thought 
of as a combination of several distinct subpopulations (say, a collection of 
ethnic groups). An excellent reference on the Bayesian analysis of models 
of this type is the book by Titterington, Makov, and Smith (1985). 

Unfortunately, inference using mixture distributions like (2.30) is often 
difficult. Since increasing one of the al and increasing the number of mix¬ 
ture components K can have a similar effect on the shape of the mixture, 
the parameters of such models are often not fully identified by the data. 
This forces the adoption of certain restrictions, such as requiring al = a 2 
for all k, or fixing the dimension K ahead of time. These restrictions in 
turn lead to difficulties in computing the relevant posterior and predictive 
densities; in this regard see West (1992) and Diebolt and Robert (1994) for 
a list of potential problem areas and associated remedies. 

An appealing alternative to discrete finite mixtures like (2.30) would be 
continuous mixtures of the form 

f{y\g) = J f(y\0)g(d)de . 

But this creates the problem of choosing an appropriate mixing distribution 
g. Of course we could select a member of some parametric family g( 0\if), 
but the appropriate choice might not be apparent before seeing the data. 
As such, we might wish to specify g nonparametrically , in effect, allowing 
the data to suggest the appropriate mixing distribution. 
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The most popular method for doing this is via the Dirichlet process (DP) 
prior, introduced by Ferguson (1973). This prior is a distribution on the set 
of all probability distributions J>. Suppose we are given a base distribution 
Go and a scalar precision parameter v. Then, for any finite partition of 
the parameter space {Bi,..., B m } the DP prior is a random probability 
measure G E V that assigns a Dirichlet distribution with m-dimensional 
parameter vector ( vGq{B \),..., vGo{B m )) to the vector of probabilities 
(G(-Bi),..., G{B m )). (The Dirichlet distribution is a multivariate general¬ 
ization of the beta distribution; see Appendix A.) A surprising consequence 
of this formulation is that the DP assigns probability 1 to the subset of dis¬ 
crete probability measures. We use the standard notation G ~ DP(vGq) 
to denote this Dirichlet process prior for G. 

The use of the Dirichlet process in the context of mixture distributions 
essentially began with Antomak (1974). Here, we work with the set of 
probability distributions p for y having densities of the form 

MG) = j f(y\6)dG(6) , 

where G ~ DP(vG o). Hence T is a nonparametric mixture family, with 
GO ) playing the role of the conditional distribution of 8 given G. Note 
that we may partition 8 as 9\, 82 and DP mix on only the subvector 8\ , 
placing a traditional parametric prior on 6 2 ; many authors refer to such 
models as semiparametric. 

In practice, we will usually have independent data values t/j, i = 1 
Inserting a latent parameter 8i for each yi we obtain the hierarchical model 

Yi\8i fi(m\8i) 

8i\G ~ G(8) (2.31) 

G|Go, v ~ DPivGo ) 

where here we DP mix over the entire 8 vector. 

Computational difficulties precluded the widespread use of DP mixture 
models until very recently, when a series of papers (notably Escobar, 1994, 
and Escobar and West, 1992, 1995) showed how Markov chain Monte Carlo 
methods could be used to obtain the necessary posterior and predictive dis¬ 
tributions. These methods avoid the infinite dimension of G by analytically 
marginalizing over this dimension in the model, obtaining the likelihood 

n 

II/<<*!*> P(0i,---,0»|G O ,«'). (2.32) 

A — 1 

Note that this marginalization implies that the 0, are no longer condi¬ 
tionally independent as in (2.31). Specifying a prior distribution on v and 
the parameters of the base distribution Go completes the Bayesian model 
specification. The basic computational algorithm requires that Go be con- 
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jugate with the f i} but a recent extension due to MacEachern and Miiller 
(1994) eliminates this restriction (though, of course, not without addi¬ 
tional algorithmic complexity). Most recently, Mukhopadhyay and Gelfand 
(1997) combine DP mixtures with overdispersed exponential family models 
(Gelfand and Dalai, 1990), producing an extremely broad extension of the 
generalized linear model setting. 

DP mixing is an extremely recent addition to the applied Bayesian 
toolkit, and many practical issues remain to be addressed. For example, 
note that large values of the DP precision parameter u favor adoption of 
the base measure Go as the posterior for the 0,, while small values will 
produce the empirical cdf of the data as this posterior. As such, the pre¬ 
cise choice of the prior distribution for v (and the sensitivity of the results 
to this decision) is still an area of substantial debate. Other unresolved 
issues include the proper selection of computational algorithm and associ¬ 
ated convergence checks, and the suitability and feasibility of the models 
in real data settings. Still, the substantial current research activity in non- 
parametric Bayesian methods suggests a bright future in practice. 


2.6 Exercises 

1. Let Yi,y 2 .... be a sequence of random variables (not necessarily inde¬ 
pendent) taking on the values 0 and 1. Let S n = Ij + .. ,Y n . 

(a) Represent _P(y n+1 = IIS',, = s) in terms of the joint distribution of 
they. 

(b) Further refine the representation when all permutations of the y such 

that S n = s have equal probability (i.e., the y are exchangeable). 

(c) What is the conditional probability if the y are independent? 

2. Confirm expression (2.2) for the conditional posterior distribution of 9 
given <t 2 in the conjugate normal model, and show that the precision in 
this posterior is the sum of the precisions in the likelihood and the normal 
prior, where precision is defined to be the reciprocal of the variance. 

3. Confirm expression (2.9) for the conditional posterior distribution of 
given 9 in the conjugate normal model. If your random number generator 
could produce only y 2 random variates, how could you obtain samples 
from (2.9)? 

4. Let 9 be the height of the tallest individual currently residing in the city 
where you live. Elicit your own prior density for 9 by 

(a) a histogram approach 

(b) matching an appropriate distributional form from Appendix A. 

What difficulties did you encounter under each approach? 
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5. For each of the following densities from Appendix A, provide a conjugate 
prior distribution for the unknown parameter(s), if one exists: 

(a) X ~ Bin(n,9), n known 

(b) X ~ NegBin(r,0), r known 

(c) X ~ Mult(n, 0), n known 

(d) X ~ G(a,f3), a known 

(e) X ~ G(a,(3), (3 known 

(f) X~ N k (61,17), X known 

(g) X ~ N k (0, £), 6 known 

6 . For the following densities from Appendix A, state whether each rep¬ 
resents a location family, scale family, location-scale family, or none of 
these. For any of the first three situations, suggest a simple noninforma- 
tive prior for the unknown parameter(s): 

(a) X ~ DE(l,a 2 ) 

(b) X ~ Unif(9 -a,0 + a) 

(c) X ~ Logistic(p, 1) 

(d) X ~ G(a,/3), a known 

(e) X ~ G(a,(3), (3 known 

7. Let 9 be a univariate parameter of interest, and let ■y — gtQ) be a 1-1 
transformation. Use (2.10) and (2.11) to show that (2.12) holds, i.e., that 
the Jeffreys prior is invariant under reparametrization. (Hint: What is 
the expectation of the so-called score statistic, <71og/(x|0)/<i0?) 

8 . Suppose that f{x\9, a) is normal with mean 6 and standard deviation <j. 

(a) Suppose that <7 is known, and show that the Jeffreys prior (2.10) for 
6 is given by 

p(0) = 1,0 £$. 

(b) Next, suppose that 9 is known, and show that the Jeffreys prior for 
a is given by 

p(a) = a > 0 . 

<T 

(c) Finally, assume that 0 and a are both unknown, and show that the 
bivariate Jeffreys prior (2.13) is given by 

p(9, cr) = 9 £ Ht, <7 >0 , 

(T z 

a slightly different form than that obtained by simply multiplying the 
two individual Jeffreys priors obtained above. 

9. Show that the Jeffreys prior based on the binomial likelihood f(x\8) = 
(") 9 X (1 - 9) n ~ x is given by the Beta(.h, .5) distribution. 
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10. Suppose that in the situation of Example 2.2 we adopt the noninforma- 
tive prior p{9, a) = A-, 9 E 5ft, <7 > 0 (i.e., p(9, a 2 ) = 6 6 37, cr 2 > 0). 

(a) Show that the marginal posterior for t = y/n{9 — y)/s, where s 2 = 
E "=1 (2/i - 2/) 2 /(« - 1), is proportional to 

[1 + t 2 /(n — l)] _n / 2 , 


a Student's t distribution with (n - 1) degrees of freedom, 

(b) Show that the marginal posterior for cr 2 is proportional to 


(<r 


.2\-[(n-l)/ 2 +t] 


exp 






an IG 


de " s,,y ' 


11. Suppose that Y\9 ~ G(l, 0) (i.e., the exponential distribution with mean 
8), and that 6 ~ IG(a, (3). 


(a) Find the posterior distribution of 0. 

(b) Find the posterior mean and variance of n 

(c) Find the posterior mode of 9. 

(d) Write down two integral equations that could be solved to find the 
95% equal-tail credible interval for 9. 

12. Suppose X \,..., X n NegBin(r. 9), r known, and that 9 is assigned a 
Betaia. 0) prior distribution. 


(a) Find the posterior distribution of 0. 

(b) Suppose further that r = 5, n = 10, and we observe ^ n =1 Xi = 70. 
Of the two hypotheses H\ : 9 < .5 and 77 2 : 9 > .5, which has greater 
posterior probability under the mildly informative Beta(2,2) prior? 

(c) What is the Bayes factor in favor of 777? Does it suggest strong evi¬ 
dence in favor of this hypothesis? 


13. Consider the two-stage binomial problem wherein 9 ~ G, and X{\9 
Bemoulli(O), i = 1, .. ., n. 

(a) Find the joint marginal distribution of X n = (A’i, ..., X n ). 

(b) Are the components independent? In any case, describe features of 
the marginal distribution. 

(c) Let S n = Xi + • • • + X n . Under what conditions does S n have a 
binomial distribution? Describe features of its distribution. 

(d) If dG{9 ) = d9, find the distribution of S n and discuss its features. 

14. For the Dirichlet process prior described in Section 2.5, use the properties 
of the Dirichlet distribution to show that for any measurable set B, 


© 2000 by CRC Press LLC 



(a) E[G(B)] = G tt (B) 

(b) Var[G(B)\ = G 0 (B)(1 - G 0 {B))/{v + 1) 

What might be an appropriate choice of vague prior distribution on vl 
15. The data in Table 2.2 come from a 1980 court case (Reynolds v. Sheet 
Metal Workers) involving alleged applicant discrimination. Perform a 
Bayesian analysis of these data. Is there evidence of discrimination? 



Selected 

Rejected 

Total 

Black 

14 

30 

44 

White 

41 

39 

80 

Total 

55 

69 

124 


Table 2.2 Summary table from Reynolds v. Sheet Metal Workers 

16. The following are increased hours of sleep for 10 patients treated with 
soporific B compared with soporific A (Cushny and Peebles, presented 
by Fisher in Statistical Methods for Research Workers): 

1.2,2.4,1.3,1.3,0.0,1.0,1.8,0.8,4.6,1.4 

Perform a Bayesian analysis of this data and draw conclusions, assuming 
each component of the likelihood to be 

(a) normal 

(b) Student's t with 3 degrees of freedom (f 3 ) 

(c) Cauchy (f i) 

(d) Bernoulli (i.e., positive with some probability 9), dealing with the 0 
somehow (for example, by ignoring it). 

Don't forget to comment on relevant issues in model assessment and 
model selection. 

( Hint: Write a Fortran, C, Matlab, or S-plus program to handle any 
necessary numerical integration and computation of marginal posterior 
probabilities. Subroutine libraries or statistical packages may be helpful 
for the integration or plotting of marginal posterior distributions.) 
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CHAPTER 3 


The empirical Bayes approach 


3.1 Introduction 

As detailed in Chapter 2, in addition to the likelihood, Bayesian analysis 
depends on a prior distribution for the model parameters. This prior can 
be nonparametric or parametric, depending on unknown parameters which 
may in turn be drawn from some second-stage prior. This sequence of 
parameters and priors constitutes a hierarchical model. The hierarchy must 
stop at some point, with all remaining prior parameters assumed known. 
Rather than make this assumption, the empirical Bayes (EB) approach uses 
the observed data to estimate these final stage parameters (or to directly 
estimate the Bayes decision rule) and then proceeds as though the prior 
were known. 

Though in principle EB can be implemented for hierarchical models hav¬ 
ing any number of levels, for simplicity we present the approach for the 
standard two-stage model. That is, we assume a likelihood f( yl 0 ) for the 
observed data y given a vector of unknown parameters 0 and a prior for 
0 with cdf G(0) and corresponding density or mass function g( 0\rj), where 
0 is a vector of hyperparameters. With q known, the Bayesian uses Bayes' 
rule (2.1) to compute the posterior distribution, 

my. n) = , (3.1) 

m(y|rj) 

where m(y |r/) denotes the marginal distribution of y, 


^ylv) = J f(y\O)g{0\q)d0 


or, in more general notation, m G {y) = / f(y\0)dG(0). 

If r) is unknown, the fully Bayesian (Chapter 2) approach would adopt 
a hyperprior distribution, h(q), and compute the posterior distribution as 


K%) 


J f(y\0)g(0\v)h(Ti)dri 

/ f f{y\u)g(u\v)h(r])dudr) 


= J P(9\y,v)h(q\y)dq. 


The second representation shows that the posterior is a mixture of condi- 
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tional posteriors (3.1) given a fixed -q, with mixing via the marginal poste¬ 
rior distribution of q. 

In empirical Bayes analysis, we instead use the marginal distribution 
(3.2) to estimate q by q = q(y) (for example, the marginal maximum 
likelihood estimator). Inference is then based on the estimated posterior 
distribution p(9\y,q). The EB approach thus essentially replaces the in¬ 
tegration in the rightmost part of (3.3) by a maximization, a substantial 
computational simplification. The name "empirical Bayes" arises from the 
fact that we are using the data to estimate the hyperparameter q. Ad¬ 
justments to account for the uncertainty induced by estimating q may be 
required, especially to produce valid interval estimates (see Section 3.5). 

The development in the preceding paragraph actually describes what 
Morris (1983a) refers to as the parametric EB (PEB) approach. That is, 
we assume the distribution g(0\q) takes a parametric form, so that choosing 
a (data-based) value for q is all that is required to completely specify the 
estimated posterior distribution. For more flexibility, one can implement 
a nonparametric EB (NPEB) approach, in which G(0) has an unknown 
form. Pioneered and championed by Robbins (1955, 1983). and further 
generalized and modernized by Maritz and Lwin (1989, Section 3.4). van 
Houwelingen (1977) and others, this method first attempts to represent 
the posterior mean in terms of the marginal distribution, and then uses 
the data to estimate the Bayes rule directly. Recent advances substitute 
a nonparametric estimate of G(Q), generalizing the approach to a broader 
class of models and inferential goals. The nonparametric setup is discussed 
in the next section, with the remainder of the chapter primarily devoted to 
the parametric EB formulation. 

3.2 Nonparametric EB (NPEB) point estimation 

3.2.1 Compound sampling models 

Since, for EB analysis, G(Q) is not completely known and at least some 
of its features must be estimated, repeated draws from G are required to 
obtain sufficient information. Consider then the compound sampling model, 

Yi 1 0i fi(yi | 6i) , and (3.4) 

Oi ~ (?(•), 

where i = 1 , ... ,k. Compound sampling arises in a wide variety of ap¬ 
plications including multi-site clinical trials, estimation of disease rates in 
small geographic areas, longitudinal studies, laboratory assays, and meta- 
analysis. Model (3.4) can be extended by allowing regression and correlation 
structures in G, or allowing the Y{ to be correlated given the 0,. 

Suppose we seek point estimates for the 0 ,, and that the prior cdf G 
has corresponding density function g. Writing Y = (1),..., Yj.) and 6 = 
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(Oi,... ,9k), it is left as an exercise to show that the marginal density of 
the data ra<;(y) satisfies the equation 

k 

m G (y) = ’ ( 3 - 5 ) 

i=l 

where ma(yi) = f fi(yi \ 9i)dG(9i). That is, the Y % are marginally inde¬ 
pendent (and identically distributed as well if /, = / for all i). 


3.2.2 Simple NPEB (Robbins' method) 


Consider the basic compound model (3.4) where G is completely unspeci¬ 
fied, and Yi\9i is Poisson(dj). That is, 

e - e -0 y ‘ 

}{yi I 0i)= -r-, Vi = o, 1,2,.... 

yd 

As shown in Appendix B, the Bayes estimate of 9{ under squared error loss 
is the posterior mean, 

Of = E(9i | y) = E{9i \ Vi ) 
f(u«>+'e-y yi l)dG(u) 
f (uV-e~ u /y,:!)dG(u) 

(yi + 1 )m G (yj + 1) (3 6) 

ma{yi) 

where m,G{yi) is the marginal distribution of y t . Thus, thanks to the special 
structure of this model, we have written the Bayes rule in terms of the data 
and the marginal density, which is directly estimable from the observed 
data. 

Taking advantage of this structure, Robbins (1955) proposed the com¬ 
pletely nonparametric estimate computed by estimating the marginal prob¬ 
abilities by their empirical frequencies, namely. 


aB _ , , , n#(2/ s e 9 ual to Vi + !) 

Oi — {yi + 1)— -7T. -T--\— 

#(ys equal to yi ) 


(3.7) 


This formula exemplifies borrowing information. Since the estimate of the 
marginal density depends on all of the j/;, the estimate for each component 
9i is influenced by data from other components. 

The foregoing is one of several models for which the analyst can make em¬ 
pirical Bayes inferences with no knowledge of or assumptions concerning G. 
Maritz and Lwin (1989) refer to such estimators as simple EB estimators. 
The added flexibility and robustness of such nonparametric procedures is 
very attractive, provided they can be obtained with little loss of efficiency. 

The Robbins estimate was considered a breakthrough, in part because he 
showed that (3.7) is asymptotically optimal in that, as k —► oo, its Bayes risk 
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(see (B.6) in Appendix B) converges to the Bayes risk for the true Bayes 
rule for known G. However, in Section 3.2.3 we show that this estimator 
actually performs poorly, even when k is large. This poor performance can 
be explained in part by the estimator's failure to incorporate constraints 
imposed by the hierarchical structure. For example, though the Bayes es¬ 
timator (3.6) is monotone in y, the Robbins estimate (3.7) need not be. 
Equation (3.6) also imposes certain convexity conditions not accounted for 
by (3.7). Several authors (van Houwelingen, 1977; Maritz and Lwin, 1989, 
Subsection 3.4.5) have developed modifications of the basic estimator that 
attempt to smooth it. One can also generalize the Robbins procedure to 
include models where the sampling distribution f is continuous. 

Large-sample success of the Robbins approach and its refinements de¬ 
pends strongly on the model form and use of the posterior mean (the Bayes 
rule under squared error loss). A far more general and effective approach 
adopts the parametric EB template of first estimating G by G and then 
using (3.6) with G replaced by G. That is, we use m ( j( ) in place ofra G ( 
This approach accommodates general models and loss functions, imposes all 
necessary constraints (such as monotonicity and convexity), and provides 
a unified approach for discrete and continuous f. To be fully nonparamet- 
ric with respect to the prior, the approach requires a fully nonparametric 
estimate of G. Laird (1978) proved the important result that the G which 
maximizes the likelihood (3.5) is a discrete distribution with at most k 
mass points; G is discrete even if G is continuous. We introduce this non¬ 
parametric maximimum likelihood (NPML) estimate in Subsection 3.4.3, 
and consider it in more detail in Section 7.1. 


3.2.3 Example: A ccident data 

We consider counts of accident insurance policies reporting t/i claims during 
a particular year (Table 3.1, taken from Simar, 1976). The data are discrete, 
and the Poisson likelihood with individual accident rates drawn from a prior 
distribution (producing data exhibiting extra-Poisson variation) is a good 
candidate model. 

We compute the empirical Bayes, posterior mean estimates of the rate pa¬ 
rameters 8 { using several priors estimated from the data and the "Robbins 
rule" that directly produces the estimates. The table shows the Robbins 
simple EB estimate, as well as the NPEB estimate obtained by plugging 
the NPML for G into the Bayes rule. As mentioned in Section 3.1, another 
option is to use a parametric form for G, say G (Oi |r/), and estimate only the 
parameter rj. Given our Poisson likelihood, the conjugate G( a, 0 prior is 
the most natural choice Proceeding in the manner of Example 2.5, we can 
then estimate rj = (a,/?) from the. marginal distribution rn(y\a. (3) plug 
them into the formula for the Bayes rule, and produce a parametric EB 
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EG{Qj\yi) 


Vi 

count 

m G (yi ) 

Robbins 

Gamma 

NPML 

0 

7840 

.82867 

.168 

.159 

.168 

1 

1317 

.13920 

.363 

.417 

.372 

2 

239 

.02526 

.527 

.675 

.610 

3 

42 

.00444 

1.333 

.933 

1.001 

4 

14 

.00148 

1.429 

1.191 

1.952 

5 

4 

.00042 

6.000 

1.449 

2.836 

6 

4 

.00042 

1.750 

1.707 

3.123 

7 

1 

.00011 

0.000 

1.965 

3.142 


Table 3.1 Simar (1976) Accident Data: Observed counts and empirical Bayes 
posterior means for each number of claims peryearfork = 9461 policies issued by 
La Royal Beige Insurance Company. The y, are the observed frequencies, Pq is the 
obsen’ed relative frequency, "Robbins" is the Robbins NPEB nile, "Gamma" is the 
PEB posterior mean estimate based on the Poisson/gamma model, and "NPML" 
is the posterior mean estimate based on the EB nde for the nonparametric prior. 


estimate. The specifics of this approach are left as Exercise 9. (The student 
may well wish to study Section 3.3 before attempting this exercise.) 

Table 3.1 reports the empirical relative frequencies for the various ob¬ 
served yi values, as well as the results of modeling the data using the 
Robbins rule, the PEB Poisson/gamma model, and the NPML approach. 
Despite the size of our dataset (k = 9461), the Robbins rule performs errati¬ 
cally for all but the smallest observed values of yi. It imposes no restrictions 
on the estimates, and fails to exploit the fact that the marginal distribu¬ 
tion of y is generated by a two-stage process with a Poisson likelihood. 
By first estimating the prior, either parametrically or nonparametrically, 
constraints imposed by the two-stage process (monotonicity and convexity 
of the posterior mean for the Poisson model) are automatically imposed. 
Furthermore, the analyst is not restricted to use of the posterior mean. 

The marginal method of moments estimated mean and variance for the 
gamma prior are 0.2144 and 0.0160, respectively. Because the gamma dis¬ 
tribution is conjugate, the EB estimate is a weighted average of the data 
y and the prior mean, with a weight of 0.7421 on the prior. For small 
values of y it all three estimates of E(6i\y,) in Table 3.1 are in close agree¬ 
ment. The predicted accident rate for those with no accidents yi =0) is 
approximately 0.16, whereas the estimate based on the MLE is 0, since 
qMLE _ y. T| le Poisson/gamma and NPML methods part company for y 
values greater than 3. Some insight into the plateau in the NPML estimates 
for larger y is provided by the estimated prior g ) displayed in Table 3.2. 
Virtually all of the mass is on "safe" drivers (the first two mass points). 
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mass point, 6 0.089 0.580 3.176 3.669 
mass, g{6) .7600 .2362 .0037 .0002 


Table 3.2 NPML prior estimate for the Simar Accident Data 

As y increases, the posterior loads mass on the highest mass point (3.669), 
but the posterior mean can never go beyond this value. 

Of course, Table 3.1 provides no information on the inferential perfor¬ 
mance of these rules. Maritz and Lwin (1989, p. 86) provide simulation 
comparisons for the Robbins, Poisson/gamma and several "improved Bob¬ 
bins" rules (they do not consider the NPML), showing that the EB ap¬ 
proach is very effective. Tomberlin (1988) uses empirical Bayes to predict 
accident rates cross-classified by gender, age, and rating territory. He shows 
that the approach is superior to maximum likelihood in predicting the next 
year's rates, especially for data cells with a small number of accidents. 


3.3 Parametric EB (PEB) point estimation 


Parametric empirical Bayes (PEB) models use a family of prior distribu¬ 
tions G(9\ri) indexed by a low-dimensional parameter rj. In the context of 
our compound sampling framework (3.4), if r/ were known, this would imply 
that the posterior for 0 i depends on the data only through y, . namely, 


P(6i\yi,0) 


_ fi(yi\S,)g{0i\r]) 


(3.8) 


mi{yi\ri) 

But since t] is unknown, we use the marginal distribution of all the data 
m (y|t?) to compute an estimate r), usually obtained as an MLE or method 
of moments estimate. Plugging this value into (3.8), we get the estimated 
posterior p(9i |y,, r)j. This estimated posterior drives all Bayesian inferences; 
for example, a point estimate 9, is the estimated posterior mean (or mode, 
or median). Note that (9, : depends on all the data through ri = fj( y) Morris 
(1983a) and Casella (1985) provide excellent introductions to PEB analysis. 


3.3.1 Gaussian/Gaussian m odels 

We now consider the two-stage Gaussian/Gaussian model, 

Yi\6i N(9i,a 2 ), i= and (3.9) 

6i\p W N(fi, r 2 ), i=l,...,k. 

First, we assume both r 2 and <j 2 are known, so that ij = fi. Calculations 
similar to those for the posterior density in Example 2.1 show that the Yi 
are marginally independent and identically distributed as N(p, cr 2 -f r 2 ) 
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random variables. Hence (3.5) implies that 


*(y|A<) 


1 


[ 2 tt (< 7 2 + r 2 )]"/ 2 


exp 


2(cr 2 + r 2 ) 


- /*) s 


* = 1 


so that y = y = £ y,- is the marginal MLE of y. Thus the estimated 

posterior distribution is 

P(0i\yi,fi) = N(By + (1 - B)yi , (1 - S)(7 2 ) , 
where B = <r 2 /(<r 2 + r 2 ). This produces the PEB point estimate of (), t . 

Bt = By + (l-B) yi (3.10) 

= V+ (1 - B)(yi - y) . (3.11) 

Notice that this formula is identical to that derived in Example 2.1, except 
that a known prior mean (y) is replaced by the sample mean using all of 
the data. Inference on a single component (a single 0i depends on data 
from all the components (borrowing information). Formula (3.10) shows 
the estimate to be a weighted average of the estimated prior mean (y) and 
the standard estimate (r/ t ), while (3.11) shows that the standard estimate 
is shrunk back toward the common mean a fraction of the distance between 
the two. The MLE is equivalent to setting B = 0, while complete pooling 
results from setting B = 1. 

As in the Robbins example, the PEB estimates borrow information from 
all the coordinates in making inference about a single coordinate. In a 
slightly more general setting below, we shall show that these PEB estimates 
outperform the usual MLE when the two-stage model holds. This result is 
not surprising, for we are taking advantage of the connection between the 
k coordinate-specific 0s provided by G. What is surprising is that estimates 
such as those above can outperform the MLE even when G is misspecified, 
or even when the 9s are not sampled from a prior. 

We now generalize the above by retaining (3.9), but assuming that r 
is also unknown. The marginal likelihood must now provide information 
on T) = (y, r). We are immediately confronted with a decision on what 
estimate to use for r (or r 2 or B). If we use the marginal MLE (MMLE) 
for r 2 , then the invariance property of MLEs implies that the MMLE for 
B results from plugging f 2 into its formula. For model (3.9), it is easy to 
show that the MMLE for y is again y, while the MMLE for r 2 is 


f 2 = (s 2 — <r 2 )+ = max{0, 


2 }, 


(3.12) 


where s 2 = i ~ V) 2 ■ This "positive part" estimator (3.12) reports 

the variation in the data over and above that expected if all the 6s are 
equal (i.e., a' 2 ). Transforming (3.12), we find that the MMLE for B is 


B 


a 2 + f l 


a 2 + (s 2 - cr 2 )+ 


(3.13) 
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Clearly 0 < B < 1; this would not have been the case had we used the 
unbiased estimate of r 2 which divides Yli-iiVi ~ V) 2 by k - 1 instead ol 
k and does not take the positive part. Since we know the target quantity 
B must lie in [0, 1], this provides another illustration of the problems with 
unbiased estimates. 

Substituting (3.13) for B in (3.11) produces the EB estimates 


= y + (l-B)( yi -y). 


(3.14) 


As in the case where r is known, the target for shrinkage is y, but now 
the amount of shrinkage is controlled by the estimated heterogeneity in the 
data. The more the heterogeneity exceeds cr . the smaller B becomes, and 
the less the shrinkage. On the other hand, if the estimated heterogeneity 
is less than cr 2 , 5 = 1 and all estimates equal y (i.e., the data are pooled). 
As heterogeneity increases, the weight on the estimated coordinate-specific 
estimates increases to 1. 

As discussed by Laird and Louis (1987), representation (3.11) shows a 
direct analogy with regression to the mean. In the model, Yi and Oi have 
correlation (the infra-class correlation) equal to (1 - 5), and an estimate 
for Oi based on )(■ regresses back toward the common mean. In the usual 
regression model, we estimate the means and correlations from a data set 
where both the Ys and Os (the response variables) are observed. In Bayes 
and empirical Bayes analysis, a distribution on the Os provides sufficient 
information for inference even when the Os are not observed. 

Example 3.1 We apply our Gaussian/Gaussian EB model (3.9) to the 
data in Table 3.3. Originally analyzed in Louis (1984), these data come 
from the Hypertension Detection and Lollow-up Program (HDLP). This 
study randomized between two intervention strategies: referred care (RC), 
in which participants were told they had hypertension (elevated blood pres¬ 
sure) and sent back to their physician, and stepped care (SC), in which par¬ 
ticipants were entered into an intensive and hierarchical series of lifestyle 
and treatment interventions. The data values Y are MLE estimates of 1000 
times the log-odds ratios in comparing five-year death rates between the SC 
and RC regimens in twelve strata defined by initial diastolic blood pressure 
(1 = 90-104, II = 105-114, 111 = 115+), race (B/W), and gender (L/M). 

In a stratum with five-year death rates r sc and r rc , we have 



These estimated log-relative risks have estimated sampling standard devi¬ 
ations which we denote by &. These vary from stratum to stratum, and 
control the amount of shrinkage in a generalization of the empirical Bayes 
rule (3.14) wherein B is replaced by B,: = erf/(ef? + f 2 ). In this unequal 
sampling variance situation, MMLEs for y and r 2 require an iterative al¬ 
gorithm (see Subsection 3.4). 
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Group 

Y 

0^ 

D 

(7 

PSD 

Qcon 

A 

I 

BM 

-129 

-157 

54 

170 

125 

-149 

81 


BF 

-304 

-240 

44 

206 

137 

-285 

74 


WM 

-242 

-220 

59 

153 

117 

-240 

84 


WF 

-355 

-253 

39 

231 

144 

-316 

69 

II 

BM 

-274 

-213 

29 

290 

155 

-255 

59 


BF 

-529 

-266 

23 

337 

161 

-383 

51 


WM 

-41 

-156 

22 

349 

162 

-136 

50 


WF 

809 

-61 

13 

479 

171 

127 

34 

III 

BM 

-558 

-273 

23 

337 

161 

-398 

51 


BF 

-235 

-197 

18 

389 

166 

-231 

44 


WM 

336 

-122 

13 

483 

171 

-38 

34 


WF 

1251 

1 

O 

GO 

6 

730 

178 

43 

18 


Table 3.3 Analysis of 1000 x log(odds ratios) for five-year survival in the HDFP 
study: (i = -188, f 2 = ( 183) 2 D = 100(1 - B), and PSD = dD%/ 10 , the 
posterior standard deviation, t) MT =-188 + 2~\ Y 4- 188), the usual PEB point 
estimate, while 9 con =-229 ■—( Y + 229, the Louis (1984) constrained EB 

estimate. 


Table 3.3 displays both the EB posterior means, 0 ,lT and a set of con¬ 
strained EB estimates, 6 con . Like the usual EB estimate, the constrained 
estimate is a weighted average of the estimated prior mean and the data, 
but with an adjusted prior mean estimate (here, -229 instead of -188) 
and a different weight on the data (A/100). Constrained estimates are de¬ 
signed to minimize SEL subject to the constraint that their sample mean 
and variance are good estimates of their underlying empirical distribution 
function of the 6s. We defer the derivation and specifics of constrained EB 
estimation and histogram estimates more generally to Section 7.1. 

Notice from Table 3.3 and Figure 3.1 that the subgroup estimates with 
high variance for the MLE tend to produce the most extreme observed es¬ 
timates and are shrunken most radically toward the estimated population 
mean (i.e., the value of 1- B is relatively small). The PSD column reports 
"naive" posterior standard errors that have not been adjusted for having 
estimated parameters of the prior. Notice that these are far more homo¬ 
geneous than the a values; highly variable estimates have been stabilized. 
However, as we shall show in Section 3.5, these values need to be increased 
to account for having estimated the prior parameters. This increase will be 
greater for subgroups with a smaller value of 1 - B and for those further 
from ft. 

Typical relations between observed data and empirical Bayes estimates 
are shown in Figure 3.1, which plots the raw Y values on the center axis, 
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Figure 3.1 HDFP raw data and EB estimates. 


connected by solid lines to the corresponding EB estimates 0 ,lT on the up¬ 
per axis. The most extreme estimates are shrunk the most severely, and 
rankings of the estimates can change (i.e., the lines can cross) due to the 
unequal sampling variances of the Y s. Efron and Morris (1977) give an ex¬ 
ample of this phenomenon related to geographic distribution of disease. The 
constrained EB estimates 8 con plotted on the figure's lower axis, shrink ob¬ 
served estimates less than the posterior means, and produce ranks different 
from either the raw data or the standard EB estimates. ■ 

The Gaussian/Gaussian model exhibits the more general phenomenon 
that EB estimates provide a compromise between pooling all data (and thus 
decreasing variance while increasing the bias of each coordinate-specific 
estimate) and using only coordinate-specific data for the coordinate-specific 
estimates (and increasing variance while decreasing bias). Complete pooling 
occurs when B = 1, i.e., when the chi-square statistic for testing equality 
of the k means is less than or equal to k/(k - 1) or large k and t = 0, 
the probability of complete pooling (equivalent to accepting the hypothesis 
that the coordinate-specific means are equal) is approximately .50. Thus, 
the test has type I error of .50, but this apparent liberality is compensated 
for by the smooth departure from complete pooling. 

Proper selection of hyperparameter estimates is one inherent difficulty 
with the PEB approach. We have already proposed that using the positive 
part estimator for B in (3.13) is more sensible than using an unbiased esti- 


© 2000 by CRC Press LLC 






mate, but many other options for B are possible, and it may be difficult to 
guess which will be best in a given situation. By contrast, the fully Bayesian 
model that adds another level to the distributional hierarchy (i.e. includ¬ 
ing a hyperprior h(p, r 2 )) replaces estimation by integration, thus avoiding 
the problem of selecting an estimation criterion. This solution also deals 
effectively with the non-identically distributed case (for example, when the 
sampling variances are different), and automatically injects uncertainty in¬ 
duced by estimating the hyperparameters into the posterior distribution. 
However, this does require selection of an effective hyperprior, an issue we 
shall revisit in Subsection 3.5.2 and the subsequent Example 3.3. 


3.3.2 Beta/binomial model 

As a second example, consider the EB approach to estimating the success 
probability in a binomial distribution. With X the number of events in n 
trials and 0 the event probability, the sampling distribution is 

P(X = x | 9) =, f(x | 9) = 1 - 9) n - x . 

For the empirical Bayesian analysis we follow Subsection 2.3.4 and use the 
conjugate Beta{a, b ) prior distribution. For convenience, we reparametrize 
from (a, b) to (p M) where p = <i/(a + b), the prior mean, and M =a + b, 
a measure of prior precision (i.e., increasing M implies decreasing prior 
variance). Specifically, V(9 \ p, M) — fi{ 1 — p)/(M + 1). 

The marginal distribution of X is beta-binomial with density 

P(X = x \ p, M) = 

r(M) ( n \ r(x+Mv,)T(n-x+M(\-u)) (3.15) 

r(M/i)r(M(t-,u)) \ X ) r(n+M) • 

Notice that if//, = .5 and M = 2, 0 is uniformly distributed a priori, and the 
marginal distribution is P(X = x \ p, M) = l/(n + 1) a discrete uniform 
distribution on the integers 0, 1 , . . ., n. 

In the EB context we have a two-stage, compound sampling model where 
Oi ~ Beta(p, M), and Xi\9i ~ Bin(rii , #;). The MMLE for p, M), or 
equivalently (a, b), is found by subscripting n and x by i and maximizing 
the product of (3.15) over i. This maximization requires evaluation of the 
gamma and digamma functions. A method of moments estimate, while not 
as efficient as the MMLE, is revealing. Using the iterated expectation and 
variance formulae, and denoting the prior mean and variance of by p and 
r 2 we have 


E 



= E 


E 




= E[9i\ = p 
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V 
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Xi 

Tli 

Oiil-Oi) 
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1 / J 


L \ n i 1 / J 


+ ^ [6i\ 


A*(l - A») , 2 n i - 1 
-h r - 

n,* Tii 

Ml ~ M [ 1 n.j - 1 

n, M + 1 


The pooled mean, A' + /n + is a natural method of moments estimate for p, 
but with unequal n;, there are a variety of method of moments estimates 
for ft and M (see Louis and DerSimonian, 1982). If all the n,- are equal, 
matching s 2 = j — /i) 2 and the marginal variance produces 


M 1 - ft) - s 2 

2 _ A(i-A) 


(3.16) 


Notice that M decreases to 0 as s 2 increases to its maximum value of 
Ml — £)i an( l increases to infinity as s 2 decreases to its binomial distribution 
value. Ml - £)/«. 

The estimates fi and M can now be used in a plug-in EB approach. The 
estimated conditional distribution of 9, given Xi is Beta(Xi + Mfi, Hi — 
Xi + M(1 — fi)). The estimated posterior mean is then 




which is the posterior mean with "hatted" values for the prior parameters. 
Hence 0f' A 1 is a weighted average of the prior mean and the MLE, Xi /n,, 
with weight depending on the relative size of M and n t . Increasing M 
(decreasing s 2 ) puts more weight on the prior mean. 

The estimated posterior variance is [p,( 1 - fii)}/{M + n t ). As we warned 
for the Gaussian/Gaussian model, confidence intervals based on this vari¬ 
ance or on the full estimated posterior will be liberal since they do not 
account for uncertainty in estimating the prior parameters. 

For an example of applying the EB approach to inference for binomial 
distributions, see Tarone (1982) and Tamura and Young (1986). A fully 
Bayesian alternative is provided by Dempster et al. (1983). 


„ Mfi + Xi 

= fii = - 

M + Tli 

M n Xi 

— — - ft + —-, 

M -f Tli M + Tli n i 
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3.3.3 EB Performance of the PEB 


To see how well the empirical Bayes rule works when model (3.9) holds, we 
can compute the preposterior (Bayes) risk (B.6) for a. single component 
or for the average over all components. Since the Y i are i.i.d., the two 
computations will be equal and we take advantage of this equality. 

Calculating the preposterior risk requires integration over both the data 
and parameters, and we may integrate in either order. As shown in Ap¬ 
pendix B, integrating only over the distribution of the parameters condi¬ 
tional on the data gives the posterior risk, while integrating only over the 
sampling distribution of the data conditional on the parameters gives the 
frequentist risk. Each is revealing. First, we consider the posterior risk for 
the i th component of the EB estimator (3.14). Under squared error loss , 
the usual posterior "variance plus squared bias" formula gives 

(1 - B)a 2 + [By + (1 - B) Vi -By-( 1 - B) Vi ] 2 . 

Regrouping terms gives 

(l-B) ( r 2 +B 2 (y-y) 2 + (B-B) 2 (y i -y) 2 ~2B(B-B)(y i -y)(y-y). (3-17) 

The first term is the usual Bayes posterior risk. The second term accounts 
for estimating y the third for estimating B, and the fourth for correlation 
in these estimates. Notice that the last two depend on how far y i is from 
y, and that the second and third are very similar to those used to compute 
confidence intervals for regression lines (see, e.g., Snedecor and Cochran, 
1980, Section 9.9). 

To get the preposterior risk we take the expectation of (3.17) with respect 
to the marginal distribution of Y. The computation is far easier if we first 
average over all coordinates, to obtain 


T E ~ *D 2 I y) = (1 - B)a 2 + B 2 (y - y) 2 + (B- B) 2 SS y , (3.18) 


where SS y = Y!i=i{Vi ~ V? since the last term in (3.17) sums to 0. 

The marginal distribution of each Y) is Gaussian with mean y and vari¬ 
ance a 2 + t 2 = cr 2 / B. For ease of calculation we take the estimate of B to 
be of the form 


B 


ccr 

SS V 


(3.19) 


(i.e., we do not use a "positive part," as in (3.13)). We then expand the 
third term in (3.18) and take expectations using properties of the gamma 
distribution to obtain the preposterior risk. 


r(G, = (1 - B)a 2 + ^ -2c+k-l^j (3.20) 

provided that k > 4. Again, the first term is the usual Bayes risk, the 
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second accounts for estimating fi , and the third accounts for estimating B. 
The unbiased estimate of B uses c — k - 3, giving 2Ba 2 /k for the last 
term. This is also its minimum value for this restricted class of estimators, 
but overall performance can be improved by improving the estimate of B 
(for example, by truncating it at 1). 

Formula (3.20) shows the Bayes/EB advantage, when the compound 
sampling model holds. The usual MLE (0 — y { ) has preposterior risk a 2 , 
so for c — k - 3, the savings in risk is 



which can be substantial. Recall that this savings applies to each coordinate 
6i , though we have averaged over coordinates in our calculations. 


3.3.4 Stein estimation 

The above results show that there is great potential to improve on the per¬ 
formance of the standard maximum likelihood estimate when the two-stage 
structure holds and the assumed form of the prior distribution is correct. 
The prior provides a connection among the coordinate-specific parameters. 

Now consider the other extreme where there is no prior distribution: we 
assume only that Y{\0{ N(0i,cr 2 ), i k, where <r 2 is assumed 

known. We investigate the purely frequentist properties of the EB pro¬ 
cedure for the fixed vector G = (0 lt . . ., 0k) 1 - For there to be any hope 

of improvement over the maximum likelihood estimate Q (Y) = Y, we 
must provide a connection among the coordinates, and, without a prior, we 
do this on the "other end" by assessing summed squared error doss (SSEL), 

1 ^ 

SSEL(Y, 6) = - ^(<?,-(Y) — 6i) 2 . (3.21) 

i — 1 

Hence, while the parameters are unconnected in the model, our inferences 
regarding them are connected through the loss function. Note that for con¬ 
venience, we are actually computing averaged squared error loss, but will 
still refer to it as SSEL. 

This loss structure is "societal" in that it combines performance over all 
coordinates and does not pay specific attention to a single coordinate as 
would, say, the maximum of the coordinate-specific losses. The results that 
follow are dependent on a loss function that averages over coordinates; the 
data analyst must decide if such a loss function is relevant. 

Consider the frequentist risk (equation (B.4) in Appendix B) for SSEL, 

R(G,6(Y)) = E(SSEL( Y, 6) \ G) , (3.22) 
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and our Gaussian sampling distribution. It is easy to see that 


R{e,e MLE ( y)) = ±J>((*-0,-) 2 |0) 



2 = 1 


which, of course, is constant regardless of the true value of 9. Stein (1955) 

« MLE 

proved the remarkable result that for k > 3. 9 (Y) is inadmissible as 

an estimator of 9. That is, there exists another estimator with frequentist 
risk no larger than rr 2 for every possible 9 value. One such dominating 
estimator was derived by James and Stein (1961) as 


f)/ s (Y) = 


(k — 2 )< t 2 
l|Y|| 2 . 


v; 


where ||Y|| 2 = yi 2 . These authors showed that 


(3.23) 


R 






k -2 
k 


E Y\9 


{(k-iy y 
V l|Y|| 2 )\ 


(3.24) 


which is less than cr 2 provided k > 3. The subscript on the expectation 
in (3.24) indicates that it is with respect to the full sampling distribution 
/(Y|0), but it actually only requires knowledge of the simpler, univariate 
distribution of ||Y|| given ||0||. 

The risk improvement in (3.24) is often referred to as "the Stein effect." 
While it is indeed surprising given the mutual independence of the k com¬ 
ponents of Y it does depend strongly on a loss function that combines over 
these components. The reader may also wonder about its connection to em¬ 
pirical Bayes analysis. This connection was elucidated in a series of papers 
by Efron and Morris (1971, 1972a,b, 1973a,b, 1975, 1977). Among many 

- JS 

other things, these authors showed that 9 is exactly the PEB point esti¬ 
mator in the Gaussian/Gaussian model (3.9) that assumes p to be known 
and equal to 0, and estimates B by 


B = 


k- 2 

PiF 


(3.25) 


an unbiased estimator of B (the proof of this fact is left as an exercise). As 
suggested by the previous subsection, EB performance of the James-Stein 
estimator can be improved by using a positive part estimator of B. leading 
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to 


ef s (y) 


1 — min 


t (k — 2 )<T 2 

V IIYIP 



so that estimates of 9i are never shrunk past the prior mean, 0. (Notice 
that this improvement implies that the original James-Stein estimator is 
itself inadmissible!) Another way to generalize §f s (Y) is to shrink toward 
prior means other than p = 0, or toward the sample mean in the absence 
of a reliable prior estimate for p. This latter estimator is given by 


§f s '(Y) = Y + 



(k - 3)<r 2 ' 
||Y-y|| 2 . 


(Yi-Y). 


Note that this estimator is a particular form of our Gaussian/Gaussian EB 
point estimator (3.14). For SSEL, one can show that 


JS' 


R[6,9 (Y) =<r 


i _ k _ 3 / k 3 

k e y\Q ( ||Y-y|P 


— ~ JS' 

where the expectation depends only on ||0-6>|| 2 . Hence 6 (Y) dominates 

the MLE Y provided k > 4, a slightly larger dataset being required to "pay" 
for our estimation of p. 

The basic result (3.24) can be proven directly (see, e.g., Cox and Hinkley, 
p. 446-8), or indirectly by showing that the expectation of (3.24) when the 

Oi are i.i.d. from a iV(0, r 2 ) prior matches the preposterior risk of 0 under 
this prior for all r. But these expectations depend only on ||0|| 2 , which is 
distributed as a multiple of a central chi-square distribution. Completeness 
of the chi-square distribution dictates that, almost surely, there can be 
only one solution to this relation, and the Stein result follows. We leave the 
details as an exercise. 

Figure 3.2 plots the frequentist risk of the James-Stein estimator when 
cr 2 = 1 and k = 5. As ||0|| 2 increases, the risk improvement relative to 
the MLE decreases and eventually disappears, but performance is never 
worse than the MLE Y. Since Y is also the minimax rule, the James-Stein 
estimator is subminimax for k > 3 (see Subsection B.1.4 in Appendix B). 


Coordinate-specific vs. coordinate average loss 


The PEB results in Subsection 3.3.3 show how effective the empirical Bayes 
approach can be in estimating individual parameters when the assumed 
compound structure holds. The Stein results in Subsection 3.3.4 show that 
the EB estimators perform well for all 0 vectors, when evaluated under 
SSEL. However, the coordinate-specific risk of these estimators can be high 
if a relatively small number of the 9s are outliers (an unlikely event if the 
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Figure 3.2 Frequentist risk for Stein's mle with k = 5 


9i are sampled from a Gaussian prior). In fact, one can show that 


sup max R(9i,9: 

Q i<i<k v ’ ' 


JS 


) = 


ka 2 


(3.26) 


where R(0,. f?j /s ) is defined as a single component of the sum in (3.24). The 

supremum is attained by setting 9\ = 9? = • • • = = 0 and letting 

6k oo at rate \fk, (see the exercises). In this limiting case, the average 

2 

expected squared error loss is , a substantial improvement over the MLE. 
But, for the MLE, the supremum is <r 2 and Stein's rule can produce a 
substantial coordinate-specific penalty in this extreme situation. This poor 
performance for outlying parameters has motivated limited translation rules 
(Efron and Morris, 1972a), which are compromises between MLEs and 
Stein rules that limit componentwise risk. Selection of priors and likelihoods 
which are more robust against outliers is another possible remedy, and one 
we shall return to in Chapter 6. 

In most applications, even if the assumed prior does not hold, the co¬ 
ordinate parameters are not so completely unrelated as in the situation 
producing (3.26), and so, when carefully applied, the empirical Bayes ap¬ 
proach is effective even for coordinate-specific evaluations. Many empirical 
case studies support this point; for a sampler of applications, see Rubin 
(1980) on law school admissions, Tomberlin (1988) on insurance rating, 
Efron and Morris (1973b, 1977) on batting averages and toxoplasmosis, 
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Devine et al. (1994, 1996) on disease mapping and accident mapping, van 
Houwelingen and Thorogood (1995) on kidney graft survival modeling, and 
Zaslavsky (1993) on the census undercount. 

In other applications, SSEL may be the most reasonable performance 
benchmark. For example, Hui and Berger (1983) use the approach to sta¬ 
bilize residual variance estimates based on a small number of degrees of 
freedom. The estimated variances are then used to produce weights in a 
subsequent analysis. Overall performance is important here, with little in¬ 
terest in individual variance estimates. 


3.4 Computation via the EM algorithm 

For the two-stage model, EB analysis requires maximization of the marginal 
likelihood given in (3.2). For many models, standard iterative maximum 
likelihood methods can be used directly to produce the MMLE and the ob¬ 
served information matrix. In more complicated settings, the EM algorithm 
(where EM stands for "expectation-maximization") offers an alternative 
approach, as we now describe. 

Versions of the EM algorithm have been used for decades. Dempster, 
Laird, and Rubin (1977) consolidated the theory and provided instructive 
examples. The EM algorithm is attractive when the function to be max¬ 
imized can be represented as a missing data likelihood, as is the case for 
(3.2). It converts such a maximum likelihood situation into one with a 
"pseudo-complete" log-likelihood or score function, produces the MLE for 
this case, and continues recursively until convergence. The approach is most 
effective when finding the complete data MLE is relatively straightforward. 
Several authors have shown how to compute the observed information ma¬ 
trix within the EM structure, or provided various extensions intended to 
accelerate its convergence or ease its implementation. 

Since the marginal likelihood (3.2) can be represented as a missing data 
likelihood, the EM algorithm is effective. It also provides a conceptual 
introduction to the more advanced Monte Carlo methods for producing 
the full posterior distribution presented in Section 5.4. Here we sketch the 
approach and provide basic examples. 

3.4.1 EM for PEB 

Consider a model where, if we observed 0, the MLE for rj would be relatively 
straightforward. For example, in the compound sampling model (3.4), given 
0 the MLE for rj can be computed using only the prior g. Suppressing the 
dependency on y on the left-hand side of equations, let 

r\ 

S(0\V) = ^log (</(%)) (3-27) 


© 2000 by CRC Press LLC 



be the score function. For the "E-step," let denote the current estimate 
of the hyperparameter at iteration j, and compute 

S(rj\r] iJ} ) = E(S(6\q) |y, ?? (j) ) . 

This step involves standard Bayesian calculations to arrive at the condi¬ 
tional expectation of the sufficient statistics for r] based on 6. 

The "M-step" then uses S in (3.28) to compute a new estimate of the 
hyperparameter, and the recursion proceeds until convergence. That is, 

^(i+i) _ ar g max | ??(•')) , j = 1,2,... (3.29) 

n 

If both the E and M steps are relatively straightforward (the most attrac¬ 
tive case being when the conditional distributions in (3.28) and (3.29) are 
exponential families), the EM approach works very well. 

Dempster et al. (1977), Meilijson (1989), Tanner (1993, p.43), and many 
other researchers have shown that, at every iteration, the marginal likeli¬ 
hood either increases or stays constant, i.e., 

m(y | r/^ +1 )) > m( y | for all j , 


and thus the algorithm converges monotonically. This convergence could be 
to a local (rather than global) maximum, however, and so, as with many 
optimization algorithms, multiple starting points are recommended. 

Example 3.2 Consider the Gaussian/Gaussian model (3.9) with unequal 
sampling variances, and let q = (//, T), where for notational convenience 
we use T for r 2 . For model component i, -2 times the loglikelihood for q 
as a function of 0, is log(T) + (0* - /u) 2 )/T, producing the score vector 


s(0.-to) = - 



(3.30) 


Expanding the second component gives 


1 Of -2 9m + y? 
T ~ T 2 


The MLE for q depends on the sufficient statistics JT 0,- and JT Of ■ The EM 
proceeds by computing S as in (3.28) using the conditional posterior dis¬ 
tribution p(0, |y,, i> <J T ( C) : substituting the posterior mean for 9 t and the 
posterior variance plus the square of the mean for Of, solving the "psuedo- 
complete"-data MLE equations, and continuing the recursion. Note that 
the algorithm requires expected sufficient statistics, so Of must be replaced 
by its expected value, not by the square of the expectation of 0 t (as though 
the imputed 0s were observed). ■ 


For this model and generalizations, the EM partitions the likelihood max¬ 
imization problem into components that are relatively easy to handle. For 
example, unequal sampling variances produce no additional difficulty. The 
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EM is most attractive when both the E and the M steps are relatively more 
straightforward than dealing directly with the marginal likelihood. How¬ 
ever, even when the E and M steps are non-trivial, they may be handled 
numerically, resulting in potentially simplified or stabilized computations. 


3.4.2 Computing the observed information 

Though the EM algorithm can be very effective for finding the MMLE, 
it never operates directly on the marginal likelihood or score function. 
It therefore does not provide a direct estimate of the observed informa¬ 
tion needed to produce the asymptotic variance of the MMLE. Several 
approaches to estimating observed information have been proposed; all are 
based on the standard decomposition of a variance into an expected con¬ 
ditional variance plus the variance of a conditional expectation. 

Using the notation for hierarchical models and suppressing dependence 
on t], the decomposition of the Fisher information for r) is 

Id - E (Iq \y + Iy ) ■ (3-3!) 

where I is the information that would result from direct observation of 8. 
The first term on the right is the amount of information on 8 provided 
by the conditional distribution of the complete data given y, while the 
second term is the information on T) in the marginal likelihood (the required 
information for the MMLE). 

Louis (1982) estimates observed information versions of the left-hand 
side and the first right-hand term in (3.31), obtaining the second term by 
subtraction. Meng and Rubin (1991) generalize the approach, producing 
the Supplemented EM (SEM) algorithm. It requires only the EM computer 
code and a subroutine to compute the left-hand side. Meilijson (1989) shows 
how to take advantage of the special case where y is comprised of inde¬ 
pendent components to produce an especially straightforward computation. 
With i indexing components, this approach uses (3.28) for each component, 
producing S<, and then 


f Y =Y. S ^i ' (3 ' 32) 

i 

requiring no additional computational burden. This gradient approach can 
be applied to (3.30) in the Gaussian example. With unequal sampling vari¬ 
ances (of), validity of the approach requires either that they are sampled 
from a distribution (a "supermodel"), producing an i.i.d. marginal model, 
or weaker conditions ensuring large sample convergence. Otherwise, the 
more complicated approaches of Louis (1982) or Meng and Rubin (1991) 
must be used. 
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3.4.3 EM for NPEB 


The EM algorithm is also ideally suited for computing the nonparamet- 
ric maximum likelihood (NPML) estimate, introduced in Subsection 3.2 
above. To initiate the EM algorithm, recall that the prior G{9) which 
maximizes the likelihood (3.5) in the nonparametric setting must be a dis¬ 
crete distribution with at most k mass points. Thus, assume the prior has 
mass points = (pt ^,..., /Xj 0i ) and probabilities 7 r(°) = ( 7 Tj 0) ,..., tij 01 ), 
where J =k. Then, for the -f l) 5t iteration, let 


w. 


(" + !) _ 

ij 

.(«'+!) 




(i/+i) 

- 11 1' 


= w 


+j 


and let maximize JT w % * l°g( /« (IPj) - Note that the updated 7 rs 

and fis maximize weighted likelihoods, and that the updated ps are not 
necessarily equal to the previous values. The ws are particularly straight¬ 
forward to compute using Bayes' theorem: 


w 


OH-i) 






Normalization requires dividing by the sum over index j. 
For the Poisson sampling distribution, we have 


V' (v) 

,.(^+D Vi 

~ (") ’ 

Li «; 0 - 

a weighted average of the y,. 

The fi'p should be spread out to extend somewhat beyond the range 
of the data. The number of mass points in the NPML is generally much 
smaller than k. Therefore, though the number of mass points will remain 
constant at J , some combination of a subset of the f 1 ’) approaching, 0 and 
the fjy) approaching each other will occur as the EM iterations proceed. 
The EM algorithm has no problem dealing with this convergence to a ridge 
in the parameter space, though convergence will be very slow. 


3.4.4 Speeding convergence and generalizations 

Several authors (see, e.g.. Tanner, 1993, pp. 55-57) provide methods for 
speeding convergence of the EM algorithm using the information decom¬ 
position (3.31) to compute an acceleration matrix. Though the approach is 
successful in many applications, its use moves away from the basic attrac¬ 
tion of the EM algorithm: its ability to decompose a complicated likelihood 
maximization into relatively straightforward components. 

Several other extensions to the EM algorithm further enhance its effec¬ 
tiveness. Meng and Rubin (1993) describe an algorithm designed to avoid 
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the nested EM iterations required when the M-step is not available in closed 
form. Called the Expectation/Conditional Maximization (ECM) algorithm, 
it replaces the M-step with a set of conditional maximization steps. Meng 
and Rubin (1992) review this extension and others, including the Supple¬ 
mented ECM (SECM) algorithm, which supplements the ECM algorithm 
to estimate the asymptotic variance matrix of the MMLE in the same way 
that the SEM algorithm supplements EM. 

Finally, the EM algorithm also provides a starting point for Monte Carlo 
methods. For example, if the E-step cannot be computed analytically but 0 
values can be sampled from the appropriate conditional distribution, then S 
can be estimated by Monte Carlo integration (see Section 5.3 for details). 
Wei and Tanner (1990) refer to this approach as the Monte Carlo EM 
(MCEM) algorithm. In fact, the entire conditional distribution of S could 
be estimated in this way, providing input for computing the full posterior 
distribution in a "data augmentation" approach (Tanner and Wong, 1987). 
Meng and Rubin (1992) observe that their Partitioned ECM (PECM) al¬ 
gorithm, which partitions the parameter vector into k subcomponents in 
order to reduce the dimensionality of the maximizations, is essentially a 
deterministic version of the Gibbs sampler, an extremely general and easy- 
to-use tool for Bayesian computation. We discuss this and other popular 
Markov chain Monte Carlo methods in Section 5.4. 

3.5 Interval estimation 

Obtaining an empirical Bayes confidence interval (EBCI) is, in principle, 
very simple. Given an estimated posterior p(0,-|y,:, if), we could use it as 
we would any other posterior distribution to obtain an HPD or equal tail 
credible set for 0;. That is, if q a (yi, if) is such that 

P (0; < q a {Vi, if) I Oi ~ p(fii\Vi, »/)) = «, 

then a 100(1 — a)% equal tail naive EBCI for 0 8 - is 

(' Iall{yi,i )), gr_(a/ 2 )( 2 /i, iff) • 

Why is this interval "naive?" From introductory mathematical statistics 
we have that 

Var(0j|y) = E^y [Var(0<|s/j, rf)\ + Var^,, [E($i\yi, i))\ . (3.33) 

In the Gaussian/Gaussian case, a 95% naive EBCI would be 
E(0i | yi, if) ± 1.96\/Var(0j | y~, rj) . 

The term under the square root approximates E v |y [Var(0, | yi, r})], since 
the EB approach simply replaces integration with maximization. But this 
corresponds to only the first term in (3.33); the naive EBCI is ignoring the 
posterior uncertainty about rj (i.e., the second term m (3.33)). Hence, the 
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naive interval may be too short, and have lower than advertised coverage 
probability. 

To remedy this, we must first define the notion of "EB coverage". 
Definition 3.1 t a (y) is a (1- a)100% unconditional EB confidence set 
for gif) if and only if for each rj, 

p y,e\v(9{8) G <«( y)) « 1 - <* . 

So we are evaluating the performance of the EBC1 over the variability in¬ 
herent in both 9 and the data. But is this too weak a requirement? Many 
authors have suggested instead conditioning on some data summary, b(y). 
For example, taking b{ y) = y produces fully conditional (Bayesian) cover¬ 
age- Many likelihood theorists would take b( y) equal to an appropriate an¬ 
cillary statistic instead (see Hill, 1990). Or we might simply take b( y) = y ,, 
on the grounds that y; is sufficient for 9, when is known. 

Definition 3.2 t a { y) is a 100(1- a)% conditional EB confidence set for 
g(9) givenfe(y) if,for eachfo(y) = bandr/, 

Py,8\b(y)=b,T){g{9) G t a ( y)) % 1 - a . 

Naive EB intervals typically fail to attain their nominal coverage probabil¬ 
ity, in either the conditional or unconditional EB sense (or in the frequentist 
sense). One cannot usually show this analytically, but it is often easy to 
check via simulation. We devote the remainder of this section to outlining 
methods that have been proposed for "correcting" the naive EBCI. 


3.5.1 Morris' approach 


For the Gaussian/Gaussian model (3.9), Morris (1983a) suggests basing the 
EBCI on the modified estimated posterior that uses the "naive" mean, but 
inflates the variance to capture the second term in equation (3.33). This 
distribution is p'(0i\yi, fj) — N(9f B , V ’), where 

V* = a 2 (l - + -^3 p2 ( y i ~ F ) 2 • ( 3 - 34 ) 


The first term in (3.34) approximates the first term in (3.33), and is es¬ 
sentially V arifi |t/j, tj), the naive EB variance estimate. The second term 
in (3.34) approximates V ar^-y [E(9i\yi, 77 )], the second term in (3.33), and 
thus serves to "correct" the naive EB interval by widening it somewhat. 
Notice the amount of correction decreases to zero as k increases as the 
estimated shrinkage factor B decreases, or as the data-value approaches 
the estimated prior mean. 

Morris actually proposes estimating shrinkage by 


B 


fc-3 
k — 1 


, where f 2 = 


L - Y ,( Y i - Y ) 2 -« 2 


k- 1 
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which is the result of several "ingenious adhockeries" (see Lindley, 1983) 
designed to better approximate (3.33). It is important to remember that 
this entire derivation assumes <x 2 is known; if it is unknown, a data-based 
estimate may be substituted. Morris offers evidence (though not a formal 
proof) that his intervals do attain the desired nominal coverage. Extension 
of these ideas to non-Gaussian and higher dimensional settings is possible 
but awkward; see Morris (1988) and Christiansen and Morris (1997). 


3.5.2 Marginal posterior approach 

This approach is similar to Morris' in that we mimic a fully Bayesian calcu¬ 
lation, but a bit more generally. Since rj is unknown, we place a hyperprior 
tpirf) on r), and base inference about 0 t on the marginal posterior, 

lh(0i |y) = j p(0i\y, ri)h(r]\y)dy , 

where h(rj\y) a m(y\t])u(ij). 

Several simplifications to this procedure are available for many models. 
First, if rj = r)( Y) is sufficient for t] in the marginal family m(y|i/) and has 
density p{fj\T]), then we can replace h(t] |y) by h(rj\Tj) oc p(t]\r])'i)){r)), which 
conditions only on a univariate statistic. Second, note that 

p(0i |y, rj) oc p(0i , yi | yjjti, r]) 

= p(yi\0i,yj&, n)p(6i\yj&’V) 

= f{Vi I 6i, V) 9(6i I V) , ( 3 - 35 ) 

the product of the sampling model and the prior. Hence p(6i | y, rf) depends 
only on yi and y. and the marginal posterior can be written as 

lh(6i | y) = lh(0i I Vi, tf) ~ J P(Qi ! */«> V)H 0 I i fj)dr) ■ ( 3 - 36 ) 

Appropriate percentiles of the marginal posterior lh (instead of the esti¬ 
mated posterior) determine the EBCI. This is an intuitively reasonable 
approach, since l h accounts explicitly for the uncertainty in V- 3n other 
words, mixing p(6i | y, i}) with respect to h should produce wider intervals. 

As indicated by expression (3.35), the first term in the integral in equa¬ 
tion (3.36) will typically be known due to the conjugate structure of the 
hierarchy. However, two issues remain. First, will h be available? And sec¬ 
ond, even if so, can the integral be computed analytically? 

Deely and Lindley (1981) were the first to answer both of these questions 
affirmatively, obtaining closed-form results in the Poisson/gamma and "sig¬ 
nal detection" (i.e., where 0i is a 0-1 random variable) cases. They referred 
to the approach as "Bayes empirical Bayes," since placing a hyperprior on 
rj is essentially a fully Bayesian solution to the EB problem. However, the 
first general method for implementing the marginal posterior approach is 
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Figure 3.3 Diagram of the Laird and Louis Type III Parametric Bootstrap. 


due to Laird and Louis (1987), who used the bootstrap. Their idea was 
to take h(r] \ 77 ) — 77(77 | r/), the sampling density of 77 given 77 with the 
arguments interchanged. We then approximate l p by observing 

ip{k\yi,r)) = j p{Oi\yi,v)p{v\h)dv 
1 N 

« 12/i. ij) . (3-37) 

j =1 

where 77 * ~ 77 (■ [ ?/),./ = /_/V. Notice that this estimator converges to l p 

as N — ► 00 by the Law of Large Numbers. The i f- values are easily generated 
via what the authors refer to as the Type III Parametric Bootstrap: given 
77 , we draw 6* ~ g{0 | 77 ), and then draw y* I— f(y I 0 *), for i = i,-*- 
From the bootstrapped data sample y* = (r/j ,... ,yf) we may compute 

77 * = 77 (y*). Repeating this process N times, we obtain q* ~ 77(77 | 77 ), j = 
1, .... TV for use in equation (3.37), the quantiles of which, in turn, provide 
our corrected EBC1. 

Figure 3.3 provides a. graphical illustration of the parametric bootstrap 
process: it simply mimics the process generating the data, replacing the 
unknown 77 by its estimate 77 . The "parametric" name arises because the 
process is entirely generated from a single parameter estimate 77 , rather than 
by resampling from the data vector y itself (a "nonparametric bootstrap"). 
Hence we can easily draw from /?(• | 77 ) even when this function is not 
available analytically (e.g., when 77 itself must be computed numerically, as 
in many marginal MLE settings). 

Despite this computational convenience, the reader might well question 
the wisdom of taking h — p. An ostensibly more sensible approach would 
be to choose a reasonable 7 / 1 ( 77 ), compute h(t 7 | 77 ) oc 77(77 | T])tp(T]), and 
obtain an estimator of the corresponding l h . This marginal posterior would 
match a prespecified hyperprior Bayes solution, and, unlike J p . would not 
be sensitive to the precise choice of 77 . However, if EB coverage is the 
objective, then L may be as good as //,. After all, taking quantiles of 4 will 
lengthen the naive EBC1, but need not "correct" the interval to level 1- a, 
as we desire. Still, Carlin and Gelfand (1990) show how to use the Type 
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Ill parametric bootstrap to match any R, provided the sampling density 
p(-\rj) is available in closed form. 


3.5.3 Bias correction approach 

A problem with the marginal posterior approach is that it is difficult to say 
how to pick a good hyperprior ri’(q) (i.e., one that will result in R quantiles 
that actually achieve the nominal EB coverage rate). In fact, if q(y) is 
badly biased, the naive EBCI may not be too short, but too long. Thus, in 
general, what is needed is not a method that will widen the naive interval, 
but one that will correct it. 

Suppose we attempt to tackle the problem of EB coverage directly. Recall 
that q a {yi, q) is the a th quantile of j>(9 t \ y t) 77 ). Define 

r(ti.rj. yi,a) = P(6i < q a {yi,Tj) \ 6 t ~ p(0i\yi,v)) , 

and 

R(rj, 1 n,a) = Efj\ y „r, {r{r), ?/, yi,a)} . 

Hence R is the true EB coverage, conditional on 6(y) = y, . of the naive EB 
tail area. Usually the naive EBCI is too short, i.e., 

Rin «■ >a> “ Sma11 (say ° 25) 

\ < a, a large (say .975) 

If we solved R( rj. y l .a') = a for a', then using this ex' with rj would con¬ 
ditionally "correct the bias" in our naive procedure and give us intervals 
with the desired conditional EB coverage. But of course 77 is unknown, so 
instead we might solve 

R(r), yi, a') = a (3.38) 

for a' = a'(fj , y ,, a), and take the naive interval with a replaced by a' as 
our corrected confidence interval. We refer to this interval as a conditionally 
bias corrected EBCI. For unconditional EB correction, we can replace R by 

R*(V< a ) = Erj, y ,in {r(r/,rj, yi,a)} , 

and solve R*(rj, a 1 ) = a for ex'. The naive interval with a replaced by this 
a' is called the unconditionally bias corrected EBCI. 

Imp lemen tati on 

if p(f) | yi, q) is available in closed form (e.g., the Gaussian/Gaussian and 
exponential/inverse gamma models), then solving (3.38) requires only tra¬ 
ditional numerical integration and a rootfinding algorithm. If p( q | y i: 77 ) 
is not available (e.g., if 77 itself is not available analytically, as in the Pois¬ 
son/gamma and beta/binomial models), then solving (3.38) must be done 
via Monte Carlo methods. In particular, Carlin and Gelfand (1991a) show 
how the Type III parametric bootstrap may be used in this regard. Notice 
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that in either case, since g(0i \ rj) is typically chosen to be conjugate with 

fim I 0{), 

r{j), t], Vi , a) = Fy i t] [^"^(a)] , 

where F, h is the cdf corresponding to p(0 i | y ,, rj). So even when Monte 
Carlo methods must be employed, mathematical subroutines can be em¬ 
ployed at this innermost step. 

To evaluate whether bias correction is truly effective, we must check 
whether 

a* = Erjiy tt r, [ r (0- r). yi. a'{rj, j/, : , a))] = a . (3.39) 

If 17 —► T) as k —► 00 (i.e., f) is consistent for rj), then (3.39) would certainly 
hold for large k, but, in this event, the naive EB interval would do fine as 
well! For fixed/:, Carlin and Gelfand(1990) provide conditions (basically, 
stochastic ordering of p(9, \ y,. rj ) and p(tj \ y,. y ) in rj) under which 

a + min(/i, I2) < q“ < Q + max(/i, I2) , 


where 

[a'(fj,yi,a) - r(rj. rj. y { . a'{p, y t , a))}p(r) \ yi,a)dfj , 


[a'(fi,yi,a) - r(fj. rj .«/,. a'(r). y { , or))]p(i? | Vi,a)drj . 

x I 2 < 0, the true coverage of the bias corrected EB tail 
area, a*, lies in an interval containing the nominal coverage level a. 

Example 3.3 We now consider the exponential/inverse gamma model, 

where Y\,...,Yk ~ Expo(0;) and 9i,....9k ~ IG(?j, 1). The marginal dis¬ 
tribution for yi is then m(t/ 8 | rj) — rj/(yi + l) f,+1 , yi > 0, so that the MLE 
of rj is fj = k/Y)i=i lo S(y* + !)• 

Consider bias correcting the lower tail of the EBCI. Solving R*(r), a') — a 
for a 1 (unconditional bias correction), and plotting the values obtained as 
a function of rj, we see that a'(q, a) is close to a for 77 near 1, but decreases 
steadily as r) increases, and that this decrease is more pronounced for larger 
values of a. The behavior for the upper tail is similar, except that a'{r), a) 
is now an increasing function of ij (recall ais now near 0.95). 

To evaluate the ability of the various methods considered in this sec¬ 
tion to achieve nominal EB coverage in this example, we simulated their 
unconditional EB coverage probabilities and average interval lengths for 
two nominal coverage levels (90% and 95%) where we have set the true 
value of rj = 2 and k = 5. Our simulation used 3000 replications, and set 
N = 400 in those methods that required a parametric bootstrap. In ad¬ 
dition to the methods presented in this section, we included the classical 
(frequentist) interval and two intervals that arise from matching a specific 


and 


h 


h 


-l 


'v<v 

That is, since I\ 


© 2000 by CRC Press LLC 



Interval 

Method 

Average 

Lower 

Endpoint 

Average 

Upper 

Endpoint 

Average 

Interval 

Length 

Average 
Uncond’l 
Cov. Prob. 

O 

1! 

Classical 

.335 

19.5 

19.2 

.901 

Naive EB 

.355 

3.87 

3.51 

.839 

Bias correction 

.331 

4.74 

4.41 

.897 

Laird and Louis 

.339 

5.15 

4.81 

.904 

^i-matching 

.287 

3.23 

2.95 

.868 

t/' 2 -matching 

.311 

4.00 

3.69 

.894 

7 = .95 

Classical 

.268 

39.1 

38.8 

.952 

Naive EB 

.306 

5.53 

5.22 

.900 

Bias correction 

.285 

7.84 

7.55 

.952 

Laird and Louis 

.283 

7.79 

7.50 

.954 

0 i-mat.ching 

.246 

4.46 

4.51 

.930 

02 -matching 

.265 

5.93 

5.66 

.951 


Table 3.4 Comparison of simulated unconditional EB coverage probabilities, ex¬ 
ponential/inverse gamma model. 


hyperprior Bayes solution as in equation (3.36). The two hyperpriors we 
consider are both noninformative and improper, namely, 0 i(t?) — 1 > r l > 0 
(so that hi(ti | rj) = G(k + 1, fj/k)), and 0 2 (r?) = 1 /rj, T) > 0 (so that 
I n) = G(k, fi/k)). 

Looking at the results in Table 3.4, we see that the classical method 
faithfully produces intervals with the proper coverage level, but which are 
extremely long relative to all the EB methods. The naive EB intervals 
also perform as expected, having coverage probabilities significantly below 
the nominal level. The intervals based on bias correction and the Laird and 
Louis bootstrap approach both perform well in obtaining nominal coverage, 
with the former being somewhat shorter for 7 = -90 and the latter a bit 
shorter for 7 = .95. Finally, of the two hyperprior matching intervals, only 
the one based on 0 2 performs well, with the ipi intervals being too short. 
This highlights the difficulty in choosing hyperpriors that will have the 
desired result with respect to empirical Bayes coverage. 

On a related note, the proper choice of marginal hyperparameter esti¬ 
mate f] can a l so have an impact on coverage, and this impact is difficult to 
predict. Additional simulations (not shown) suggest that the best choice for 
f) is not the marginal MLE, but the marginal uniformly minimum variance 
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unbiased estimate (UMVUE), rj = (k — 1)/ Y11=i l°g(*A: + 1)- Ironically, the 
MLE-based intervals were, in general, a bit too long for the bias correction 
method, but too short for the Laird and Louis method. ■ 


3.6 Generalization to regression structures 

We have considered the basic two-stage empirical Bayes model with asso¬ 
ciated delta-theorem and bootstrap methods for obtaining approximately 
correct posterior means and variances. A brief look at the general model, 
where the prior mean for each coordinate can depend on a vector of re¬ 
gressors, shows features of this important case and further motivates the 
need for the more flexible and applicable hyperprior Bayesian approach to 
complicated models. 

Consider the two-stage model where 

9i ~ A T (X;a, t 2 ), (3.40) 

Yi\0i ~ N(9 t .a 2 ), (3.41) 

where X is a design matrix. Then, if the parameters (a, r, a) are known, 
the posterior mean and variance of are 

E = X,a + (1 — B)(Yi — X,a) (3.42) 

V = (1 — B)a 2 , (3.43) 

with B defined as in Example 2.1. Notice that the shrinkage applies to 
residuals around the regression line. 

For the EB approach, parameters are estimated by marginal maximum 
likelihood. As we have seen for the basic model, the posterior variance 
must be adjusted for estimation uncertainty. The Morris expansion (Morris, 
1983) gives 

Vi = (l - B)a 2 + ^Ba 2 + (3.44) 

+ k _l_ 2 B 2 (Yi-X ia ) 2 , (3.45) 

where all hatted values are estimated and c is the dimension of a. Formula 
(3.44) generalizes (3.34) and is similar to the adjustment produced by using 
restricted maximum likelihood (REML) estimates of variance instead of ML 
estimates (as in SAS Proc mixed or BMDP-5V; see Appendix C). 

Derivation of (3.44) is non-trivial, and doesn't account for correlation 
among parameter estimates or the fact that the posterior is skewed and 
not Gaussian; see Laird and Louis (1987), Carlin and Gelfand (1990), and 
Section 3.5 of this book. As we shall see in Chapter 5, these derivations 
can be replaced by Monte Carlo methods that account for all of the conse¬ 
quences of not knowing prior parameters. The power of these methods for 
the linear model are further amplified for nonlinear models. For example, in 
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Example 3.1, to use a Gaussian/Gaussian model we transformed five-year 
death rates to log odds ratios. A more direct approach based on logistic 
regression requires the methods in Chapter 5. 


3.7 Exercises 

1. (Berger, 1985, p.298) Suppose V, N(9i, 1). /' = 1. k, and that the 

0, are i.i.d. from a common prior G. Define the marginal density m(t/j) 
in the usual way as 

m(yi) - j N(yi \ 0,-. 1) dG(9,) . 

(a) Show that, given G, the posterior mean for can be written as 


E(6i\y,) = y, + 


™'(yi) 

m{y,) 


where m! is the derivative of m. 

(b) Suggest a related nonparametric EB estimator for 0,. 


2. Under the compound sampling model (3.4) with fi=i for all i, show 
that (3.5) holds (i.e.. that the V,s are marginally i.i.d.). What is the com¬ 
putational significance of this result for parametric EB data analysis? 

3. Consider the gamma/inverse gamma model, i.e., Y\,..., Y\ 1 ~ i G(a , 9j), 
a a known tuning constant, and 9i,..., 9^ ~ IG(r), 1). 

(a) Find the marginal density of ?/;, m(yi\rj). 

(b) Suppose cv = 2. Find the marginal MLE of rj, q. 

4. In the Gaussian/Gaussian model (3.9), if a 2 = 1 and p = 0, the Y{S are 
marginally independent with distribution 

Y ( \t ~ (V(0,l + r 2 ) = JV(0,1/5) . 

(a) Find the marginal MLE of B, B M le, and resulting PEB point 
estimates 9f B , i = 1,..,, k. 

(b) Show that while B in (3.25) is not equal to Bmle , it is unbiased for B 
with respect to m(y\B). Show further that using this B in the usual 
PEB point estimation fashion produces the James-Stein estimator 
(3.23). 

5. Prove result (3.24) using the completeness of the non-central chi-square 
distribution. 

6. Show how to evaluate (3.24) for a general 6. {Hint: A non-central chi- 
square can be represented as a Poisson mixture of central chi-squares 
with mixing on the degrees of freedom.) 
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7. Prove result (3.26) on the maximum coordinate-specific loss for the 
James-Stein estimate. 

8 . Consider again Fisher's sleep data: 

1.2,2.4,1.3,1.3,0.0,1.0,1.8,0.8,4 6 .1.4 . 


Suppose these k = 10 observations arose from the Gaussian/Gaussian 
PEB model, 

Yitfi*- N(0i,<r 2 ), i=l,...,k, 

0i N(n, r 2 ), i = l,...,k. 

Assuming cr 2 = 1 for these data, compute 


(a) 

(b) 

(c) 

(d) 


A naive 95% EBC1 for 0 9 . 
A Morris 95% EBCI for ff s . 


Compare the two point and interval estimates you obtain. 

9. Consider the PEB model 

Yi\0i Poisson(6,U), 9 { ~ G(a, b), i = 1,..., k. 

(a) Find the marginal distribution of Y = (Yi, ..., Yt) T . 

(b) Use the method of moments to obtain closed form expressions for the 
hyperparameter estimates a and b. (Hint: Define the rates n = Y t /t r , 
and equate their first two moments, r and s 2 , to the corresponding 
moments in the marginal family.) 

(c) Let/: — 5, a = b =3, and t, = 1 for all i, and perform a simulation 
study to determine the actual unconditional EB coverage of the 90% 
equal-tail naive EBCI for 6\. 

10. Consider the data in Table 3.5. These are the numbers of pump failures, 
Yi, observed in t, thousands of hours for k = 10 different systems of 
a certain nuclear power plant. The observations are listed in increasing 
order of raw failure rate r,-, the classical point estimate of the true failure 
rate 6i for the i th system. 


(a) Using the statistical model and results of the previous question, com¬ 
pute PEB point estimates and 90% equal-tail naive interval estimates 
for the true failure rates of systems 1, 5, 6 , and 10. 

(b) Using the approach of Laird and Louis (1987) and their Type III 
parametric bootstrap, obtain corrected 90% EBCIs for 0 5 and 0 6 . 
Why does the 9 5 interval require more drastic correction? 

11. In the Gaussian/Gaussian EM example. 
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i 

Yi 

ti 

Vi 

i 

5 

94.320 

.053 

2 

1 

15.720 

.064 

3 

5 

62.880 

.080 

4 

14 

125.760 

.111 

5 

3 

5.240 

.573 

6 

19 

31.440 

.604 

7 

1 

1.048 

.954 

8 

1 

1.048 

.954 

9 

4 

2.096 

1.910 

10 

22 

10.480 

2.099 


Table 3.5 Pump Failure Data (Gayer and O'Muircheartaigh. 1987) 


(a) Complete the derivation of the method for finding MMLEs of the 
prior mean and variance. 

(b) Use the Meilijson approach to find the observed information. 

(c) For this model, a direct (but still iterative) approach to finding the 
MMLE is straightforward. Find the gradient for the marginal distri¬ 
bution and propose an iterative algorithm for finding the MMLEs. 

(d) Find the observed information based on the marginal distribution. 

(e) Compare the EM and direct approaches. 

12. Do a simulation comparison of the MLE and three EB approaches: Bob¬ 
bins, the Poisson/gamma, and the Gaussian/Gaussian model for esti¬ 
mating the rate parameter in a Poisson distribution. Note that the Gaus¬ 
sian/Gaussian model is incorrect. Investigate two values of k (10, 20) and 
several true prior distributions. Study gamma distributions with large 
and small means and large and small coefficients of variation. Study two 
distributions that are mixtures of these gammas. Compare approaches 
relative to SSEL and the maximum coordinate-specific loss and discuss 
results. You need to design the simulations. 

Extra credit: Include in your evaluation a Maritz (forced monotonicity) 
improvement of the Robbins rule, and the rule based on the NPML (see 
Section 7.1 and Subsection 3.4.3 above). 
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CHAPTER 4 


Performance of Bayes procedures 


In the preceding two chapters, we developed the Bayes and EB approaches 
to data analysis, and evaluated their effectiveness on their own terms (e.g., 
by checking the Bayes or EB coverage of the resulting confidence sets). 
In this chapter, we present our new technologies with a more formidable 
challenge: can they also emerge as effective in traditional, frequentist terms? 
One might expect the news to be bad here, but, for the most part, we will 
show that when carefully applied (usually with low-information priors), 
the Bayes and EB formalisms remain effective using the more traditional 
criteria. 

Throughout this chapter, we employ the criteria outlined in Appendix B, 
our review of decision theory. There, using squared error loss (SEL) and 
its frequentist expectation, mean squared error (MSE), we develop basic 
examples of how shrinkage and other features of Bayes and EB procedures 
can offer improved performance. In this chapter, we elaborate on this idea 
and extend it beyond MSE performance to many other aspects of modern 
statistical modeling, such as the structuring of complex models and orga¬ 
nizing of complex computations, appropriate tracking and incorporation of 
relevant uncertainties into a final analysis, convenient structuring of sensi¬ 
tivity analyses, and documentation of assumptions and models. Note that 
none of these features are unique to. nor place any requirements on, one's 
philosophical outlook. 

We start with two basic, but possibly nonintuitive, examples of how 
Bayes rules process data. These examples serve to alert the analyst to 
the complex nature of multilevel inference, and drive home the point that 
once a model is specified, it is important to let the probability calculus 
take over and produce inferences. We then proceed to a frequentist and 
empirical Bayes evaluation of the Bayesian approach to point and interval 
estimation. We close with a brief discussion of experimental design, perhaps 
the one area where all statisticians agree on the importance of prior input. 
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4.1 Bayesian processing 

Most of the common examples of Bayesian analysis are like Example 2.1 
in that the posterior mean exhibits shrinkage from the data value toward 
the prior mean. However, univariate models with nonconjugate priors and 
multivariate models with conjugate priors can possibly exhibit nonintuitive 
behavior. In this section, we give two examples based on the Gaussian 
sampling distribution. Schmittlein (1989) provides additional examples. 

4.1.1 Univariate stretching with a two-point prior 

Consider a Gaussian sampling distribution with unit variance and a two- 
point prior with mass at ±£. Based on one observation y. the posterior 
distribution is 

with one minus this probability on — 8. The posterior expectation of 0 
thus 

e 26y - 1 

' l + e 26 y 

if 6 > 1 , for an interval of y-values around 0, the posterior mean is farther 
from 0 than is y, a situation we term stretching. Figure 4.1 illustrates the 
phenomenon when S = 1.2 for positive y values (the plot for negative y is 
a mirror image of this one). Notice that E(9\y) stretches (exceeds y) for 
y less than about 1.0. For larger y , the posterior mean does exhibit the 
usual shrinkage back toward zero, and in fact approaches an asymptote of 
E(6\y) = 6 as y —► oo. 

Stretching does violate intuition about what should happen with a uni- 
modal sampling distribution. The explanation lies in use of the posterior 
mean (resulting from squared-error loss) in a situation where the prior is 
two-point. In many applications, the posterior probability itself is a suffi¬ 
cient summary and squared error loss is inappropriate. 


4.1.2 Multivariate Gaussian model 

We now consider another example where intuition may not provide the 
correct answer, and the Bayesian formalism is needed to sort things out. 
Consider again the model of Example 2.7, wherein the coordinate-specific 
parameters and the data are both / 7 -variate Gaussian. That is, 

Y | 0 ~ N p (e.S) and 6 | A ~ Ap(0. A) . 

Writing the posterior mean and variance in their "shrinkage" form, we have 

E(0 | y) = (I-B)y 

Var(6 | y) = (I-B)A (4.1) 
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Figure 4.1 Example of stretching: Gaussian likelihood with two-point prior. 


where B = £(£ + A) -1 . If both £ and A are diagonal, there is no 
linkage among coordinates, and coordinate-specific posteriors can be com¬ 
puted without reference to other coordinates. In general, however, the full 
multivariate structure must be maintained. Preserving the full structure 
increases efficiency (DeSouza, 1991), and can produce the nonintuitive re¬ 
sult that coordinate-specific estimates can be stretched away from the prior 
mean (as in the previous example), or shrunk toward it but beyond it, a sit¬ 
uation we term crossing. There is always shrinking toward the prior mean 
for the full multivariate structure, however, in that the posterior mean in 
(4.1) has a Euclidean norm smaller than does the data y. 


A bivariate example 


Consider the bivariate case (p = 2) with £ = I and the diagonal elements 
of A equal (i.e., the two coordinates of 9 have equal prior variance). Then 
I - B has the form 


I — B 



(4.2) 


where 0 < a < 1 and a > 6|. The posterior mean shows bivariate shrinkage 
of the observed data toward the prior mean. The point determined by 
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the posterior mean is inside the circle with center at the prior mean, and 
radius the distance of the observed data from the prior mean (since I - B 
has eigenvalues at ± 6 ). However, the posterior mean of the first coordinate 
is 

h = aY 1 + bY 2 . 

Thus, stretching for this coordinate will occur if Y 2 > . Crossing will 

occur if Y-, < — 2-Yi. 

Though we have presented a two-stage Bayesian example, the same prop¬ 
erties hold for empirical Bayes (see for example Waternaux et al.. 1989) and 
multilevel hierarchical Bayesian analyses. The general lesson is to let the 
Bayesian formalism operate on the full structure of the model and sort 
out the appropriate posterior relations, since intuitions developed in basic 
models may be incorrect in more complicated settings. 

4.2 Frequentist performance: Point estimates 

Besides having good Bayesian properties, estimators derived using Bayesian 
methods can have excellent frequentist and EB properties, and produce 
improvements over estimates generated by frequentist or likelihood-based 
approaches. We start with two basic examples and then a generalization, 
still in a basic framework. These show that the Bayesian approach can be 
very effective in producing attractive frequentist properties, and suggest 
that this performance advantage holds for more complex models. 

4.2.1 Gaussian/Gaussian model 

The familiar Gaussian/Gaussian model provides a simple illustration of 
how the Bayesian formalism can produce procedures with good frequen¬ 
tist properties, in many situations better than those based on maximum 
likelihood or unbiasedness theory. We consider the basic model wherein 
Y\e ~ N(0. 1) and 0 ~ ;V(0. r 2 ) (that is, the setting of Example 2.1 where 
a 2 = 1 and ^ = 0). We compare the two decision rules 6\ (v) = y and 
S 2 (ii) = (1- B)y, where B= 1/(1+r 2 )- Here, (q is the MLE and UMVUE 
of 9. while 82 is the posterior mode, median, and mean (the latter implying 
it is the Bayes rule under squared error loss). We then have 

MSEiS,) = R(e,6 1 ) = E Y]e (e-6 1 (Y)) 2 = E Y \e(9-Y) 2 
= Var Y \g(Y) = 1 , 

and also 

MSE(6 2 ) = R(0,6 2 ) = E Y{e (6 - 6 2 (Y)f 
= E Y \ e [6 - (1 - B)Y] 2 
= E Y \ e [(l-B)(0-Y) + B0] 2 
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Figure 4.2 MSE (risk under squared error loss) for three estimators in the Gaus¬ 
sian/Gaussian model. 


= ( \-BfE ne (0-Y ) 2 

+2(1-B)B0E Y \e(Q-Y) + B 2 6 2 
= (1 -Bf + B 2 6 2 , 

which is exactly the variance of $ 2 plus the square of its bias. It is easy to 
show that MSE(S 2 ) < MSE (6 1 ) if and only if |0| < ^/(2 - B)/B. That 
is, the Bayes rule has smaller frequentist risk provided the true mean, 9, is 
close to the prior mean, 0 . 

This situation is illustrated for the case where B = 0.5 in Figure 4.2, 
where the dummy argument t is used in place of 9. MSE( 6 \) is shown 
by the solid horizontal line at 1, while MSE( 6 2 ) is the dotted parabola 
centered at 0. For comparison a dashed line corresponding to the MSE for 
a third rule, 63 (y) = 2, is also given. This rather silly rule always estimates 
9 to be 2, ignoring the data completely. Clearly its risk is given by 

MSE( 8 3 ) = R(6, S 3 ) = E Y[e (9 - 2) 2 = (9 - 2) 2 , 

which is 0 if 9 happens to actually be 2 , but increases much more steeply 
than MSE( 82 ) as 9 moves away from 2. This rule is admissible, since 
no other rule could possibly have lower MSE for all 9 (thanks to the 0 
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Figure 4.3 MSEfor three estimators in the beta/binomial model with n = 1 and 
H =0.5. 


value at 5 = 2), but the large penalty paid for other 0 values make it 
unattractive for general use. The Bayes rule 62 can a l so be shown to be 
admissible, but notice that it would be inadmissible if B < 0, since then 
MSE( 82 ) > 1 = MSE(b 1 ) for all 0. Fortunately, B cannot be negative 
since r 2 cannot be; the point here is only to show that crossing (shrinking 
past the prior mean) is a poor idea in this example. 

For the more general setting of Example 2.2, the Bayes rule will have 
smaller MSE than the MLE when 


0 € 


a 2 +2nr 2 


For the special case where r 2 = a 2 (i.e., the prior is "worth" one observa¬ 
tion), the limits simplify to /i ± aJ2 + a broad region of superiority. 


4.2.2 Beta/binomial model 

As a second example, consider again the estimation of the event probability 
in a binomial distribution. For the Bayesian analysis we use the conjugate 
Betafa, b) prior distribution, and follow Subsection 3.3.2 in reparametrizing 
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0 


Figure 4.4 MSEfor three estimators in the beta/binomial model with n = 20 and 

fi = 0.5. 


from (a, b) to (//, M) where pi = alia +/;), the prior mean, and M = a + b, 
a measure of prior precision (i.e., increasing M i mplies decreasing prior 
variance). Based on squared-error loss, the Bayes estimate is the posterior 
mean, namely, 

§n,M = Mfi + X _ M n X 

M + n M + n M + n n 

Hence 9 ,i,Af is a weighted average of the prior mean and the maximum 
likelihood estimate, X/n. Irrespective of the value of p, the MLE of 6 is 
0^’°, The MSE of the Bayes estimate, Ex\e(0^' M - 6 ) 2 , is then given by 


M + n 


V(L-O) 

n 


+ 




(4.4) 


This equation shows the usual variance plus squared bias decomposition of 
mean squared error. 

Figure 4.3 shows the risk curve for n = 1, pi = 0.5, and M = 0, 1, 2. 
If one uses the MLE with n = 1, the MLE must be either 0 or 1; no 
experienced data analyst would use such an estimator. Not surprisingly 
the MSE is 0 for $ = 0 or 1, but it rises to .25 at 9 = .5. The Bayes rule 
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e 

Figure 4.5 MSEfor three estimators in the beta/binomial model with n = 20 and 

H = 0 . 1 . 


with M = 1 (dotted line) has lower MSE than the MLE for 0 in the interval 
.5 ± ^ = (.067-933). When M = 2 (dashed line), the region where the 
Bayes rule improves on the MLE shrinks toward 0.5, but the amount of 
improvement is greater. This suggests that adding a little bias to a rule in 
order to reduce variance can pay dividends. 

Next, look at Figure 4.4, where ^ is again 0.5 but now n = 20 and M = 

0, 1, 5. Due to the increased sample size, all three estimators have smaller 
MSE than for n = 1, and the MLE performs quite well. The Bayes rule 
with M = 1 produces modest benefits for 6 near 0.5, with little penalty for 
9 near 0 or 1, but it takes a larger M (i.e., more weight on the prior mean) 
for the Bayes rule to be very different from the MLE. This is demonstrated 
by the curve for M = 5, which shows a benefit near 0.5, purchased by a 
substantial penalty for 0 near 0 or 1. The data analyst (and others who 
need to be convinced by the analysis) would need to be quite confident that 
9 is near 0.5 for this estimator to be attractive. 

Using the Bayes rule with "fair" prior mean /i = 0.5 and small precision 
M = 1 pays big dividends when n = 1 and essentially reproduces the MLE 
for n = 20. Most would agree that the MLE needs an adjustment if n = I, a 
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smaller adjustment if n = 2, and so on. Bayes estimates with diffuse priors 
produce big benefits for small n (where variance reduction is important). 

Finally, Figure 4.5 shows the costs and benefits of using a Bayes rule with 
an asymmetric prior having fi =0.1 when n = 20. As in the symmetric 
prior case //. = 0.5), setting 141 = 1 essentially reproduces the MLE. With 
M = 5, modest additional benefits accrue for 0 below about 0.44 (and a 
little above 0), but performance is disastrous for 0 greater than 0.6. Such 
an estimator might be attractive in some application settings, but would 
require near certainty that 8 < 0.5. We remark that the Bayes preposterior 
risk is the integral of these curves with respect to the prior distribution. 
These integrals produce the pre-posterior performance of various estimates, 
a subject to which we return in Section 4.4. 

In summary, Bayesian point estimators based on informative priors can 
be risky from a frequency standpoint, showing that there is no statisti¬ 
cal "free lunch." However, even for univariate analyses without compound 
sampling, the Bayesian formalism with weak prior information produces 
benefits for the frequentist. 

A note on robustness 

In both the binomial and Gaussian examples, either an empirical Bayes 
or hierarchical Bayes approach produces a degree of robustness to prior 
misspecification, at least within the parametric family, by tuning the hy¬ 
perparameters to the data. In addition, for exponential sampling distribu¬ 
tions with quadratic variance functions. Morris (1983b) has shown broad 
robustness outside of the conjugate prior family. 


4.2.3 Generalization 

The foregoing examples address two exponential family models with con¬ 
jugate prior distributions. Samamego and Reneau (1994) evaluate these 
situations more generally for linear Bayes estimators (those where the es¬ 
timate is a convex combination of the MLE and the prior mean, even if G 
is not conjugate) based on (iid) data when the MLE is also unbiased (and 
therefore MVUE). With 0 < a < 1 and 6 the MLE, these estimates have 
the form 

8g = ct8 + (1 — a)E G (9). (4.5) 

If the true prior is Go, this Bayes estimate has smaller pre-posterior MSE 
than the MLE so long as 

V Go {d) + [E a (9) - E Go (8 )] 2 < l -^MSE(G 0 , 8). (4.6) 

1 — a 

Since Go can be a point mass, this relation provides a standard, frequentist 
evaluation. 
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Inequality (4.6) can be explored quite generally to see when the Bayesian 
approach is superior. For example, if Go is degenerate at 0 O then the first 
term is 0 and the comparison depends on the relation between 0 O) the 
assumed prior mean {Eq(0)) and the weight put on the prior mean (1- a 
This special case structures the foregoing evaluations of the beta/binomial 
and Gaussian/Gaussian models. 

Samaniego and Reneau further investigate the situation when the true 
and assumed prior means are equal. Then inequality (4.6) becomes 

V Go (e)<p^MSE{G 0 ,e). ( 4 . 7 ) 

They summarize the results of their investigation by concluding that at 
least for the models considered, the Bayesian approach is quite robust in 
that it is preferred to the frequentist so long as either the assumed prior 
is approximately correct or if relatively little weight is put on the prior. 
The Bayesian will get in trouble if she or he is overly confident about an 
incorrect assumption. 


4.3 Frequentist performance: Confidence intervals 

In addition to producing point estimates with very good frequentist per¬ 
formance. the Bayesian approach can also be used to develop interval esti¬ 
mation procedures with excellent frequentist properties in both basic and 
complicated modeling scenarios. As in the previous subsection, our strategy 
will be to use noninformative or minimally informative priors to produce 
the posterior distribution, which will then be used to produce intervals via 
either HPD or simpler "equal-tail" methods (see Subsection 2.3.2). 

In complicated examples, the Bayesian formalism may be the only way to 
develop confidence intervals that reflect all uncertainties. We leave intervals 
arising from the Gaussian/Gaussian model as an exercise, and begin instead 
with the beta/binomial model. 


4.3.1 Beta/Binomial model 
Properties of the HPD interval 

Again using the notation of Subsection 4.2.2, we recall that the posterior in 
this case is Beta(a+x, b+n—x). Before considering frequentist performance, 
we examine how data are processed by the 100(1- 2a)% Bayesian credible 
interval. Suppose we adopt the uniform Betci(a = l, b = 1) prior, and 
consider the case where X = 0. We have the posterior distribution 

p(0\x = 0) = (» + 1)(1 - Of , (4.8) 
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a monotone decreasing function. Thus the HPD Bayesian credible interval 
is one-sided. From (4.8) it is easy to show that 

S(t\X = 0) = P(9 > t\X = 0) = (1 - t) n+l , 

so letting d = 2a, the upper endpoint of the FIPD interval may be found 
by solving the equation S(t) _ d for t. Doing this, we obtain the HPD 
interval as 

(o . 1 - d »tt' j . 

More generally, it is easy to show that when X = 0, the Bayesian interval 
is one-sided so long as a — M p < 1. A similar result holds for the case 
X = n, and is left as an exercise. 

Unlike the case of the Gaussian/Gaussian model, however, if one uses 
the zero-precision prior (M = 0). technical problems occur if X equals 
either 0 or n. Hence the prior must contain at least a small amount of 
information. Fortunately, we have seen that using a small value of M > 0 
actually leads to point estimates with better frequentist performance than 
the usual maximum likelihood estimate. W e investigate the properties of 
the corresponding interval estimates below. 

As an interesting sidelight, consider representing the upper limit in (4.9) 
as a pseudo number of events one might expect in a future experiment (X*) 
divided by the sample size (n). Then. 

X* = (n + 1)[1 — d "+ 1 ] —- — log( d) as n . —»■ oo . 

Thus as the sample size gets large, the numerator of the upper limit of 
the confidence interval converges to a fixed number of pseudo-events. For 
example, with d = .05 (i.e., a = .025). this number of pseudo-events is 
- log(.05) ~ 3. As reported by Louis (1981a) and Mann et al. (1988), this 
value is the upper limit for most individuals when asked: "In the light of 
zero events in the current experiment, what is the largest number of events 
you think is tenable in a future replication of the experiment?" 

Frequentist performance 

Based on the Beta(a + x, b + n — x) posterior distribution for 0, a simple 
100(1 - 2 a)% Bayesian credible interval for 9 is the "equal-tail" interval 
obtained by taking the upper and lower a-points of the posterior distribu¬ 
tion. But as we have just seen, this approach would be silly for a < 1 when 
X = 0. This is because the monotone decreasing nature of the posterior in 
this case implies that the Q values excluded on the lower end would have 
higher posterior density than any 0 values within the interval! A similar 
comment applies to the situation where b = (1 - p ) M < 1 and X = n. 

A possible remedy would be to define the "corrected equal-tail" interval 
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as follows: 


{ ( 0 , 9 i_ 5 .) , a + x < 1 

(95,1), b + n - x < 1 . (4.10) 

(9„,9tf„) , otherwise 

where q a denotes the a th quantile of the Beta(a + x. b + n — x) posterior, 
and again a = 2a. Though this interval is still straightforward to compute 
and removes the most glaring deficiency of the equal-tail procedure in this 
case, it will still be overly broad when the posterior is two-tailed but skewed 
to the left or right. 

Of course, the HPD interval provides a remedy to this problem, at some 
computational expense. Since the one-tailed intervals in equation (4.10) are 
already HPD, we focus on the two-tailed case. The first step is to find the 
two roots 6 l and 9 V of the equation n(9\x) = k(a ) for a given t(a). This is 
not too difficult numerically, since ?r in this case is simply a (a+b+n \-degree 
polynomial in 9. We next find P(0 L < 9 < 9 v \x) = F(%|j)- F(9 L \x), 
where F is the cdf of our beta posterior. Finally, we adjust k(a) up or 
down depending on whether the posterior coverage of our interval was too 
large or too small, respectively. Iterating this procedure produces the HPD 
interval. 

Notice that the above algorithm amounts to finding the roots of the 
equation 

P(0 L <9 < 0[r|jr) = 1 - 2 q , 

wherein each step requires a rootfinder and a beta cdf routine. The latter 
is now readily available in many statistical software packages (such as S), 
but the former may require us to do some coding on our own. Fortunately, 
a simple bisection (or interval halving) technique will typically be easy to 
program and provide adequate performance. For a description of this tech¬ 
nique, the reader is referred to a calculus or elementary numerical methods 
text, such as Burden et al. (1981, Section 2.1). Laird and Louis (1982) give 
an illustration of the method's use in an HPD interval setting. 

We now examine the frequentist coverage of our Bayesian credible inter¬ 
vals. Adopting the notation C(x) = (6 L (x ), 9 u (x)), this is defined as 

coverage(0) = P x \e[9 SE C{X)\ = E xle [I {e€C(X)} \ ; 

note carefully that 9 is held fixed in this calculation. To allow for the 
possible effect of posterior skewness, we first compute the lower and upper 
non-coverages 

lower(0) = E x \ e [I {e< e L ( X ) } ] (4.11) 

upper(0) = E X \ e [I{g >eu ( X )>] , 

so that coverage) $ = 1 - lower! 0 - upper(0). These coverages are easily 
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true theta value 


Figure 4.6 Coverage of "corrected" equal tail 95% credible interval under three 
symmetric priors with n = 10. 


obtained by summing binomial probabilities. For example 

lower(0) = ( ” 'j 6 X (1 - e) n - x I {s<H{x)} . (4.12) 

x=0 ' ' 

We begin by reporting coverages of the corrected equal tail intervals 
defined in equation (4.10) based on three beta priors that are symmetric 
about 0 = 0.5. These are the first three listed in Table 4.1, namely, the 
Jeffreys prior, the uniform prior, and an informative Beta( 5, 5) prior. The 
table also lists the prior mean p. the prior precision M, and the percentage 
of the weight given to the prior by the posterior when n =10, namely, 
wio = lOOjgqffo • Since the Beta( 5, 5) prior gives equal posterior weight to 
the data and the prior, this is clearly a strongly informative choice. 

For n =10, Figure 4.6 plots the coverages of the Bayesian intervals 
with nominal coverage probability (1 - 2a) = .95 across a arid of possible 
"true" 9 values from 0.05 to 0.95. Since the symmetry of the priors implies 
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Figure 4.7 Coverage of "corrected" equal tail 95% credible interval under three 
symmetric priors with n = 40. 


that lower (9) = upper(l - Q\ the latter quantity is not shown. To enhance 
interpretation of the plot, horizontal reference lines are included at the 
0.95 and 0.025 levels. We see that the two noninformative priors produce 
values for coverage (Q) that are reasonably stable at the nominal level for 


a 

b 


M 

«>10 

comments 

0.5 

0.5 

.50 

1.0 

9 

Jeffreys (U-shaped) 

1.0 

1.0 

.50 

2.0 

17 

uniform 

5.0 

5.0 

.50 

10.0 

50 

symmetric, strongly informative 

1.0 

3.0 

.25 

4.0 

29 

asymmetric, weakly informative 


Table 4.1 B eta prior distributions 
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all 9, though the discrete nature of the binomial distribution causes some 
irregularities in the pattern. However, the informative prior performs poorly 
for extreme 9 values, and only slightly better for 9 values near 0.5 (though of 
course, the narrowness of its intervals should be borne in mind as well). Of 
particular interest is the 0% coverage provided if 6 = 0.05 or 0.95, a result 
of the fact that the informative prior will not allow the intervals to stray 
this far from 0.5 - even if the data encourage such a move (i.e., X = 0 or 
n). As in the case of point estimation, we see that informative priors enable 
more precise estimation, but carry the risk of disastrous performance when 
their informative content is in error. 

Figure 4.7 repeats the calculations of Figure 4.6 in the case where n = 40. 
Again both noninformative priors produce intervals with good frequentist 
coverage, with the uniform prior now performing noticeably better for the 
most extreme 9 values. The informative prior also performs better, though 
still poorly for \9— 0.51 > 0.3. This is due to the fact that, with the increased 
sample size, the prior now contributes only 20 % of the posterior information 
content, instead of 50%. 

Finally, Figure 4.8 reconsiders the n =10 case using an asymmetric 
Beta(l, 3) prior, listed last in Table 4.1. This prior is mildly informative 
in this case (contributing 4/14 = 29% of the posterior information) and 
favors smaller 9 values over larger ones. As a result, upper (9) is no longer 
a mirror image of lower( 0 ). and this is evident in the unacceptably large 
values of the former shown for large 9 values. Again, the message is clear: 
informative priors must be used with care. 

4.3.2 Fieller-Creasy problem 

We now take up the problem of estimating the ratio of two normal means, a 
problem originally considered by Fieller (1954) and Creasy (1954). Suppose 
Xi ~ N(9i,cr 2 ) and X 2 ~ N{9 2 ,<t 2 ), assumed independent given 9 1 , # 2 > 
and a 2 . The parameter of interest is the mean ratioi 7 = 61 /0 2 . A plausible 
frequentist approach might begin by observing that jX 2 ~ N(6\ , cr 2 y 2 ), so 
that X\ — yX 2 ~ N{ 0,cr 2 (l + J 2 ))- Hence, 



and so endpoints for a 100 ( 1 - a) % confidence interval for 7 could be derived 
by solving the quadratic equation 


(X 1 -7A' 2 ) 2 = 4 /2 <x 2 (1 + 7 2 ) 


(4.13) 


for 7 . We notate the roots of this equation by 

While appearing reasonable, this approach suffers from a serious defect: 
the quadratic equation (4.13) may have no real roots. That is, the discrim- 
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Figure 4.8 Coverage of "corrected" equal tail 95% credible interval under an 
asymmetric Beta(a = 1, b = 3) prior with n = 10. 


inant of the quadratic equation, 

D{X 1 , X 2 ) = 4zl, 2 <r\Xl + X 2 - 4 /2 <t 2 ) , 

may be negative. To remedy this, we might try the following more naive 
approach. Write 


= 

X 2 0 2 + X 2 - 0 2 
X 1 

02 (l + 

Xi [\, x 2 -e 2 ] 

w ^l 1 - oT~\ 


(4.14) 


the last line arising from a one-term Taylor expansion of the denominator 
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Thus an alternate (1- a) 100% frequentist confidence interval for y is given 
by 



While this procedure will never fail to produce an answer, its accuracy is 
certainly open to question. 

By contrast, the Bayesian procedure is quite straightforward. Stephens 
and Smith (1992) show how to obtain samples from the joint posterior dis¬ 
tribution of 0 i, 02 , and a 2 under the noninformative prior k( 0 1 ,0 2 ,cr) = 1/cr 
using a Monte Carlo method called the weighted bootstrap (see Subsec¬ 
tion 5.3.2). We stay within the simpler <x 2 known setting, and adopt the 
conjugate priors 6 1 ~ N(pi, rf) and 0 2 ~ N(p 2 , r|) (recall that these may 
be made noninformative by setting T l = t 2 = °°). From Example 2.1 we 
then obtain the independent posterior distributions 


p(0i\xi) = N 


<T 2 pi+T 2 Xi (T 2 T 2 \ . 1 o 

* 2 + T 2 ’ ( t 2 + t 2 )’ 
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We could now derive the distribution of a? 2 analytically, perhaps using 

the familiar Jacobian approach, but, like Stephens and Smith (1992), we 
find it far easier to use a sampling-based approach. Suppose we draw 

e\ 9) ~ p ( 0 <|*<), g=l,...,G, * = 1 , 2 , 

and then define y la > = o[ g) /d^ g = 1, . . ., G. These y^ 9) values then con¬ 
stitute a sample of size G from the posterior distribution of y. Hence a 
(1 - a) 100 % equal-tail credible interval for y could be obtained by sorting 
the { 7 W} sample, and subsequently taking y L as the (Ga/2)* ft -smallest 
element in the sample, and yu as the (G'o/2) i, 'dargest element in the sam¬ 
ple. (Note that (G'a/2) m ay first have to be rounded to the nearest integer, 
or else this calculation may require interpolation.) 

To compare the frequentist performance of these three approaches to 
the Fieller-Creasy problem, we perform a simulation study. After choosing 
some "true" values for 9 \, 62, and cr 2 , we generate 

Xij ~N(9 it , t 2 ), j = i = 1,2, 

and apply each of the three procedures to each of the generated data pairs 
(A'ij, AGj). (Note that to implement our sampling-based Bayesian proce¬ 
dure, this means we must generate 6 ^) , g = 1 ,...,<-?} for each 

simulated data pair. Thus we effectively have a simulation within a simula¬ 
tion, albeit one that is very easily programmed.) For each simulated data 
pair, we record the lengths of the three resulting confidence intervals and 
whether the true y value fell below, within, or above each interval. Clearly 
3 method with good frequentist performance will have a coverage proba¬ 
bility of (1 - a) or better, while at the same time producing intervals that 
are not too long on the average. 

Table 4.2 gives the resulting average interval lengths and estimated prob¬ 
abilities that the true 7 lies below, within, or above the computed confi¬ 
dence interval, where we take a = .05 and N = G = 10,000 . Along with 
the exact and approximate frequentist approaches, we also evaluate two 
Bayesian methods, both of which take the two prior means jJi and p? 
equal to 3. However, the first method takes T\ = 7*2 = 1, resulting in a 
highly informative prior. The second instead takes 77 = 77 = 50, a rather 
vague prior specification. The four portions of the table correspond to the 
four true { 61 , 62 ) combinations (0,1), (0,3), (3,1), and (3,3), where < 7 2 = 1 
in all cases. 

Looking at the table, one is first struck by the uniformly poor perfor¬ 
mance of the approximate frequentist method. While it typically produces 
intervals that meet or exceed the desired coverage level of 95%, they are 
unacceptably wide - ludicrously so when 9-> is close to 0. The intervals 
based on the exact frequentist method (derived from solving the quadratic 
equation (4.13)) are much narrower, but in no case do they attain the nom- 
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Simulated Probability 

Average 


Lower 

Central 

Upper 

Length 

(01,02) = (0,1): 
Exact frequentist 

0.1415 

0.7151 

0.1433 

17.8201 

(fail rate = 0.7188) 
Approx frequentist 

0.0000 

0.9998 

0.0002 

31597.4121 

Informative Bayes 

0.4710 

0.5290 

0.0000 

4.8757 

Vague Bayes 

0.0022 

0.9955 

0.0023 

20.5473 

( 0 i , 0 2 ) = (0,3) : 

Exact frequentist 

0.0466 

0.9082 

0.0453 

6.2772 

(fail rate = 0.1072) 
Approx frequentist 

0.0049 

0.9913 

0.0038 

79.3618 

Informative Bayes 

0.5798 

0.4202 

0.0000 

1.3461 

Vague Bayes 

0.0166 

0.9671 

0.0163 

3.8475 

( 0 i , 0 2 ) = (3,1) : 

Exact frequentist 

0.0000 

0.1833 

0.8167 

23.1655 

(fail rate = 0.0828) 
Approx frequentist 

0.0000 

0.8387 

0.1613 

5559.2930 

Informative Bayes 

0.0000 

0.8094 

0.1906 

8.5788 

Vague Bayes 

0.0000 

0.9709 

0.0291 

55.2689 

(01,02) = (3,3) : 
Exact frequentist 

0.0064 

0.8428 

0.1508 

13.5938 

(fail rate = 0.0072) 
Approx frequentist 

0.0000 

0.9408 

0.0592 

25.2528 

Informative Bayes 

0.0016 

0.9962 

0.0022 

1.8507 

Vague Bayes 

0.0024 

0.9733 

0.0243 

8.2376 


Table 4.2 Simulation results, Fielder-Creasy problem 


inal 95% coverage level. This is no doubt attributable in part to the fact 
that the coverage probabilities in the table are of necessity based only on 
those cases where the discriminant D(X\j , X 2 j) was positive; the remaining 
cases ("failures") were simply ignored when computing the summary infor¬ 
mation in the table. With the exception of the final (9 1 , 0 2 ) scenario, these 
fail rates (shown just below the method's label in the leftmost column) are 
quite substantial, suggesting that whatever the method's properties, it will 
often be unavailable in practice. 

By contrast, the performance of the two Bayesian methods is quite im¬ 
pressive. The intervals based on the informative prior have both very high 
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coverage and low average length in the fourth scenario, where the prior's 
high confidence in p 1 = p 2 = 3 is justified, since these are exactly the 
values of 6\ and 0 2 . But naturally, this method suffers in the other cases, 
when the parameter values in which the method is so confident are not 
supported by the data. However, the Bayesian approach employing the 
vague prior always achieves a high level of coverage, and at a reasonable 
(though not overly short) average length. In fact, in two of the four cases 
these intervals are shorter on average than those produced by either of the 
frequentist methods. Thus we see that interval estimates derived from a 
Bayesian point of view and employing only vague (or no) prior information 
can be mechanically easy to compute, and also result in better frequentist 
performance than intervals derived from a classical approach. 

4.4 Empirical Bayes performance 

In Section 3.5 we showed how empirical Bayes (EB) confidence intervals 
may be tuned to have a desired level of empirical Bayes coverage, given an 
assumed "true" prior distribution for 0. In this section we investigate the 
EB coverage of Bayesian confidence sets more generally, evaluating 

Py,»[6 E C(Y)] = EgEY\e[I{$ec(Y)}] 

for various modeling scenarios. We again study the Gaussian/Gaussian 
model, leaving others as exercises. Since EB coverage is the subject of 
interest, we investigate the compound sampling framework, which we recall 
for this problem as 

Yi\9i N(0i, l)p = 1,..., le , and (4.15) 

9i ~ N(p, r 2 ), i = 1,..., Ic , (4.16) 

where for the purpose of illustration we set k = 5. Parallel to our devel¬ 
opment earlier in this chapter, we begin by evaluating the performance of 
Bayes and empirical Bayes point estimators of the Oi in this setting, and 
subsequently investigate the corresponding interval estimates. 

4.4.1 Point estimation 

As discussed in Chapter 3, the traditional approach to estimation does not 
combine the data in the compound sampling setting. More specifically, the 
natural frequentist point estimate based on the likelihood (4.16) would be 



the MLE and UMVUE of . By contrast, the Bayes estimate would be the 
posterior mean Of = Bp + (1 — B)yi, where B=- 1/(1+ r 2 ). Since we must 
specify specific values for p and r 2 in order to compute Of we select the 
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two most obvious choices fi = 0 and r 2 = 1. Thus we have 

Of = w/2 , 

an estimator which shrinks each observation halfway to zero. Note that this 
corresponds to a fairly strong prior belief, since the prior mean is receiving 
the same posterior weight as the data. 

To avoid choosing values for p and r 2 we know we must either estimate 
them from the data or place hyperprior distributions on them. The former 
approach results in the empirical Bayes estimator 

ef B =By+ (l-B) yi , 

where B = 1/(1 +f 2 ) and of course y — \ Yli =i ?A- Here we adopt the usual 
"positive part" estimator for r 2 namely, r 2 = (s 2 — 1) + = max{0, s 2 -11, 
where s 2 = jAy J2^=i(yi ~ V )~- Alternatively, the modified EB estimate 
suggested by Morris (1983) takes the form 

Of = B\i y + (1 - Bm )yi , 

where Bm = [(& — 3)/(Ar — 1)]B. 

As described in Chapter 3, the Morris estimator is an attempt to approx¬ 
imate the hierarchical Bayes rule. But in our normal/normal setting with 
k = 5 and using the flat hyperprior tt(/z, t 2 ) = 1, Berger (1985, p. 187) shows 
that an exact closed form expression can be obtained for the hierarchical 
Bayes point estimator, namely, 

0f B = B*y + (l-B*) yi , 


where 

and 


0(s 2 ) = 


(k — l)s 2 
2 



(k — l)s 2 
2 


- 1 



As in Example 3.3, we used simulation to evaluate the EB properties 
of our five estimators. That is, after choosing "true" values for p and r 2 , 
we sampled 0j = ,..., 6kj) from the prior (4.15) followed by Yj = 

(Yij, . .., Ykj) from the likelihood (4.16), for j = 1,..., N. This resulted 
in N simulated values for each of our five estimators, which we then used 
to estimate the EB risk under squared error loss associated with each. For 
example, for our final estimator we estimated the average risk 
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r 

se(r) 

(H,T 2 ) = (0,1) : 

Frequentist 

1.005 

0.006 

Bayes (B = 0.5) 

0.500 

0.003 

Empirical Bayes 

0.705 

0.005 

Morris EB 

0.750 

0.005 

Hierarchical Bayes 

0.762 

0.005 

(n,T 2 ) = (0,10) : 

Frequentist 

0.994 

0.006 

Bayes (B = 0.5) 

2.736 

0.017 

Empirical Bayes 

0.976 

0.006 

Morris EB 

0.953 

0.006 

Hierarchical Bayes 

0.951 

0.006 

(/r,r 2 ) = (5,1) : 

Frequentist 

1.000 

0.006 

Bayes (B — 0.5) 

6.774 

0.016 

Empirical Bayes 

0.704 

0.005 

Morris EB 

0.746 

0.005 

Hierarchical Bayes 

0.758 

0.005 

(/i,r 2 ) = (5,10): 

Frequentist 

0.987 

0.006 

Bayes (B — 0.5) 

9.053 

0.041 

Empirical Bayes 

0.975 

0.006 

Morris EB 

0.947 

0.006 

Hierarchical Bayes 

0.945 

0.006 


Table 4.3 Simulated EB risks, Gaussian/Gaussian model with k =5 


by the simulated value 

j=l 1=1 

Monte Carlo standard error estimate 


= \ 

Using N = 10, 000 replications, we simulated EB risks for four possi¬ 
ble (/j, t r 2 ) combinations: (0,1), (0,10), (5,1), and (5,10). The results of our 


N K 


rrrEE CT-M 


(Nk)(Nt - 1) 


with associated 
se(f) 
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simulation study are shown in Table 4.3. The fact that the frequentist esti¬ 
mator always has simulated risk near 1 serves as a check on our calculations, 
since 


r(6 ) = 


e 8 e Y\9 




2 — 1 


= E e 
= E e 


i k 


2 — 1 


b k > 


~ i. 


The simple Bayes estimator 6 performs extremely well in the first scenario 
(when the true // and r' 2 are exactly those used by the estimator), but poorly 
when the true prior variance is inflated, and very poorly when the mean 
is shifted as well. As in our Section 4.2 study of frequentist risk (MSE), 
shrinking toward the wrong prior mean (or even overenthusiastic shrinking 
toward the right one) can have disastrous consequences. 

By contrast, the EB performance of all three of the remaining estimators 
is quite impressive, being superior to that of the frequentist estimator when 
there is a high degree of similarity among the (i.e., r 2 is small), and 
no worse otherwise. There is some indication that the hierarchical Bayes 
estimator does better when r~ is large, presumably because these values 
are strongly favored by the flat hyperprior assumed for this estimator. 
Conversely, the uncorrected EB estimator performs better when r 2 is small. 
This is likely due to its tendencv toward overshrinkage- in the two cases 
where r 2 = 1, B was observed to equal 1 (implying () EB = jj) in roughly 
27% of the replications. 


4.4.2 Interval estimation 

In this subsection we continue our investigation of the Gaussian/ Gaussian 
compound sampling model, but in the case of evaluating the EB coverage 
of confidence intervals for 0,. Once again, the frequentist procedure does 
no pooling of the data, resulting in the nominal 100(1 - a )% interval 

Hi ± Z a f 2 , 

where, as usual, z a / 2 is the upper (a/2)-point of the standard normal dis¬ 
tribution. 

For the Bayes interval, the symmetry of the normal posterior distribution 
implies that the HPD interval and equal-tail interval are equivalent in this 
case. Since the posterior has variance (1 - B) — 0.5, we obtain the interval 

Mil 2 ± Zu/ 2 /V '2 , 
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which will always be only 71% as long as the frequentist interval, but will 
carry the risk of being inappropriately centered. 

For the uncorrected ("naive") EB interval, from Section 3.5 we take the 
upper and lower (c*/2)-points of the estimated posterior, obtaining 

ef B ±z ar _\Ji-§, 

where 0f B is as given in the previous subsection. Similarly, the Morris EB 
interval is based on the corrected estimated posterior having mean qM and 
variance 

- yf , 

as given in Subsection 3.5.1. Thus the Morris EB interval is 


e ± z al ^V* . 

Finally, a flat hyperprior for (//, r 2 ) again allows the marginal posterior 
for Oj (and hence, the hyperprior Bayes interval) to emerge in closed form, 
namely, 

ef 8 ± • 

where 0 BB is the same as before, and 

Vi HB = (l-^ B ^+G{yi-S ) 2 ■ 

In this equation, B* is the same as in the previous subsection, and 


r _ 2 (*~ 3 ) 

[(k - l)s 2 ] 2 



(k — l)s 2 
2 


(. B* 



where g(s 2 ) is also the same as before. 

Table 4.4 gives the results of using N = 10, 000 replications to simulate 
EB coverages and interval lengths for the same four (p, r 2 ) combinations 
as presented in Table 4.3. The lower, central, and upper probabilities in 
the table correspond to the proportion of times a simulated 0,- ■ value fell 
below, in, or above the corresponding confidence interval. Each interval 
uses a nominal coverage level of 95%. That the frequentist intervals are 
achieving this coverage to within the accuracy of the simulation serves 
as a check, since the intervals are guaranteed to have the proper level 
of frequentist coverage and thus must also have this coverage level when 
averaged over any proper prior (here, a N(fi, r 2 ) distribution). Next, and 
as we might have guessed from the previous simulation, the Bayes interval 
performs very well in the first case, providing the proper coverage level and 
the narrowest interval of the five. However, in the other three cases (where 
its prior is not supported by the data), its coverage performance ranges 
from very poor to abysmal. 
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Simulated Probability 

Average 


Lower 

Central 

Upper 

Length 

(/i,r 2 ) = (0,1) : 
Frequentist 

0.027 

0.948 

0.025 

3.920 

Bayes (B = 0.5) 

0.026 

0.950 

0.025 

2.772 

Naive EB 

0.186 

0.629 

0.185 

1.964 

Morris EB 

0.019 

0.964 

0.018 

3.605 

Hierarchical Bayes 

0.011 

0.978 

0.011 

4.024 

Ubr 2 ) = (0,10) : 
Frequentist 

0.024 

0.950 

0.025 

3.920 

Bayes (B = 0.5) 

0.198 

0.598 

0.203 

2.772 

Naive EB 

0.045 

0.909 

0.046 

3.535 

Morris EB 

0.025 

0.950 

0.025 

3.850 

Hierarchical Bayes 

0.021 

0.957 

0.022 

3.972 

(/i,r 2 ) = (5. 1) : 
Frequentist 

0.023 

0.951 

0.025 

3.920 

Bayes (B = 0.5) 

0.000 

0.057 

0.943 

2.772 

Naive EB 

0.184 

0.632 

0.184 

1.966 

Morris EB 

0.017 

0.965 

0.018 

3.605 

Hierarchical Bayes 

0.010 

0.980 

0.010 

4.024 

= (5,10) : 
Frequentist 

0.024 

0.952 

0.025 

3.920 

Bayes (B = 0.5) 

0.009 

0.240 

0.751 

2.772 

Naive EB 

0.043 

0.912 

0.045 

3.533 

Morris EB 

0.023 

0.952 

0.025 

3.850 

Hierarchical Bayes 

0.020 

0.959 

0.021 

3.973 


Table 4.4 EB coverages and interval lengths of nominal 95% intervals, Gaus¬ 
sian/Gaussian model with k = 5 


The naive EB interval lives up to its name in Table 4.4, missing the 
nominal coverage level by a small amount when r 2 is large, and by much 
more when r 2 is small. As discussed in Section 3.5, this interval is failing 
to capture the variability introduced by estimating ^ and r 2 . But Morris' 
attempt at correcting this interval fares much better, providing a slightly 
conservative level of coverage and shorter intervals than the frequentist 
method - noticeably so in cases where T 2 is small. Finally, the hierarchical 
Bayes method performs well when r 2 is large, but appears overly conserva¬ 
tive for small r 2 . Apparently if EB coverage with respect to a normal prior 
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is the goal, a hyperprior that places too much mass on large r 2 values will 
result in intervals that are a bit too long. A lighter-tailed hyperprior for r, 
such as an inverse gamma, or the bias correction method for EB confidence 
intervals described in Section 3.5, would likely produce better results. 

4.5 Design of experiments 

When discussing experimental design, many of the controversies surround¬ 
ing Bayesian analysis abate. Most statisticians agree that although different 
approaches to statistical inference engender different design goals, "every¬ 
one is a Bayesian in the design phase," even if only informally. Since no data 
have been collected, all evaluations are preposterior and require integration 
over both the data (a frequentist act) and the parameters (a Bayesian act). 
In the terminology of Rubin (1984), this double integration is a "Bayesianlv 
justifiable frequentist calculation." As such, experimental design shares a 
kinship with the other material in this chapter, and so we include it here. 

Whether the analytic goal is frequentist or Bayes, design can be struc¬ 
tured by a likelihood, a loss function, a class of decision rules, and either 
formal or informal use of a prior distribution. For example, in point estima¬ 
tion under SEL, the frequentist may want to find a sample size sufficiently 
large to control the frequentist risk over a range of parameter values. The 
Bayesian will want a sample size sufficient to control the preposterior risk 
of the Bayes or other rule. Both may want procedures that are robust. 

Though design is not a principal focus of this book, to show the linkage 
between Bayesian and frequentist approaches, we outline basic examples of 
Bayesian design for both frequentist and Bayesian goals. Verdinelli ( 1992) 
provides a recent review of developments in Bayesian design. 


4.5.1 Bayesian design for frequentist analysis 


Though everyone may be a Bayesian in the design stage, the same is most 
assuredly not true regarding analysis. Therefore, designs that posit a prior 
distribution for parameters, but design for a frequentist analysis, have an 
important role. We consider a basic example of finding the sample size 
to achieve desired type I error and power in a classical hypothesis testing 
model. More complicated models are addressed by Carlin and Louis (1985), 
Louis and Bailey (1990) and Shih (1995). 

Consider one-sided hypothesis testing for the mean of n normal variables, 
Yi,. Y n N(0, a 2 ). That is, H 0 : 0 - 0 versus H a : 9 > O. Standard 
computations show that for a fixed d, the n that achieves Type I error a 
and power 1 - (3 satisfies 


$ 



= 1 -/ 3 , 


(4.17) 
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where $ is the standard normal cdf and, as usual, z a is the upper o-point 
of this distribution. The standard frequentist design produces 


n$ 


! (-3 + 


(4.18) 


Since the required n depends on 0, a frequentist design will typically be 
based on either a typical or a conservative value of 9. However, the former 
can produce too small a sample size by underrepresenting uncertainty in 6, 
while the latter can be inefficient. 

Now consider the Bayesian design for this frequentist analysis. Given a 
prior g(6 ) for the treatment effect, Spiegelhalter (2000) suggests a simple 

Monte Carlo approach wherein we draw 6j g(0j ),j = 1.V, and sub¬ 

sequently apply formula (4.17) to each outcome. This produces a collection 
of sample sizes n* e •, the distribution of which reflects the uncertainty in the 
optimal sample size resulting from prior uncertainty. Adding a prior for <j 2 
as well creates no further complication. 

A slightly more formal approach would be to set the expected power under 
the chosen prior g equal to the desired level, and solve for n, i.e.. 



(4.19) 


Shih (1995) applies a generalization of this approach to clinical trial design 
when the endpoint is the time to an event. Prior distributions can be placed 
on the treatment effect, event rates, and other inputs. 

Solving for h in (4.19) requires interval-halving or some other recursive 
approach. However, if G is A '(pi. r 2 ). h satisfies 



Z 3 \/l + hr 2 . 


(4.20) 


For all aand 0 , n>ng, and if r = 0. then h — n$. 

Formulas (4.17) and (4.19) produce a sample size that ensures expected 
power is sufficiently high. Alternatively, one could find a sample size that, 
for example, makes P[ power > 3 | n] > - . 

Example 4.1 Consider comparison of two Poisson distributions, one with 
rate A and the other with rate p. Assume interest lies in the parameter 6 = 
X/fi or a monotone transformation thereof. With data S n = X\ + ■ ■ ■ + X n 

and T n = V) 4-(- Y n , respectively, W n = S n + T n is ancillary for 6 (and 

sufficient for rj = A + fi). 

Irrespective of whether one is a Bayesian or frequentist, inference can be 
based on the conditional distribution of S n given W„, 

S n | W n ~ Bin(W n , 6), 

where 9 = 6/(6 + 1). For this conditional model, W n is the sample size and 
there is no fixed value of n that guarantees W n will be sufficiently large. 
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However. Bayesian design for Bayesian analysis puts priors on both 6 and 
7 and finds a sample size that controls preposterior performance. Bayesian 
design for frequentist analysis puts a prior on r) and finds a sample size n 
which produces acceptable frequentist performance in the conditional bino¬ 
mial distribution with parameters (W n , 9) averaged over the distribution of 
B n or otherwise using the distribution of W n to pick a sample size. Carlin 
and Louis (1985) and Louis and Bailey (1990) apply these ideas to reducing 
problems of multiple comparisons in the analysis of rodent tumorigenicity 
experiments. 

If sequential stopping is a design option, one can stop at the first n 
such that W n gives acceptable frequentist or Bayesian performance. The 
prior on T] allows computation of the distribution of J\f. ■ 


4.5.2 Bayesian design for Bayesian analysis 

Having decided that Bayesian methods are appropriate for use at the design 
stage, it becomes natural to consider them for the analysis stage as well. 
But in this case, we need to keep this new analysis plan in mind as we 
design the experiment. Consider for instance estimation under SEL for the 
model 

Y u ...,Y n \e ii? f{y\9), 

where 9 has prior distribution G, and f is an exponential family with 
sufficient statistic S n = Yi + • • • + Y n . The Bayes estimate is then 9{S n ) — 
Eg(9 | S n ), with posterior and preposterior risk given by 

p{G,S n ) = V g (9 | Sn) and r(G, S n ) = E y [V G (0 \ S n )] . 

where this last expectation is taken with respect to the marginal distribu¬ 
tion of y. We begin with the following simple design goal: find the smallest 
sample size n such that 

r{G,S n )<C 2 . (4.21) 

For many models, the calculations can often be accomplished in closed 
form, as the following two examples illustrate. 

Example 4.2 Consider first the normal/normal model of Example 2.2, 
where both G and/ are normal with variances r 2 and <x 2 , respectively. Then 
p(G,S n ) = (1 — B n )a 2 /n, where B n = <r 2 /[cr 2 + nr 2 ]. This risk depends 
neither on the prior mean g nor the observed data S n Thus p(G. S n ) = 
r(G, S n ), and (4.21) produces 



As expected, increasing C decreases n, with the opposite relation for r. As 
T —► oo, n —► (§-) 2 , the usual frequentist design. ■ 


© 2000 by CRC Press LLC 



Example 4.3 Now consider the Poisson/gamma model of Example 2.5, 
where we assume that G is gamma with mean p and variance / i 2 /a, and 
/(■ | 9) is Po(6). Then 


p(G,S n ) 


and r(G,S n ) 
Plugging this into (4.21), we obtain 


Q + S n 

(n + a/p) 2 ' 


a + nil 


n 




i 



For this model, the optimal Bayesian design depends on both the prior 
mean and variance. As a —* 0, the prior variance increases to infinity and 
n —+ p/C 2 , which is the frequentist design for 0 = p. ■ 


Using preposterior Bayes risk as the design criterion produces a suffi¬ 
cient sample size for controlling the double integral over the sample and 
parameter spaces, integrating first over the sample space represents the 
preposterior risk as the frequentist risk R( 9, S„.) averaged over the param¬ 
eter space. The Bayesian formalism brings in uncertainty, so that generally 
the sample size will be larger than that obtained by a frequentist design 
based on a "typical" (i.e. expected) parameter value, but perhaps smaller 
than that based on a "conservative" value. 

A fully Bayesian decision-theoretic approach to the design problem in¬ 
vokes the notions of expected utility and the cost of sampling. While many 
of the associated theoretical considerations are beyond the scope of our 
book, the basic ideas are not difficult to explain or understand; our discus¬ 
sion here follows that in Muller and Parmigiani (1995). Suppose U(n, y, 6) is 
the payoff from an experiment having sample size n, data y = (?/i,.. ., y n )' 
and parameter 0 , while C(n, y, 0 is the cost associated with this experi¬ 
ment. A reasonable choice for U(n, y, 6 ) might be the negative of the loss 
function; i.e., under SEL we would have U(n, y, 9) = —(9 n (y) — 9 ) 2 . Cost 
might be measured simply on a per sample basis, i.e., C(n, y, 9) = cn for 
some c > 0. Then a plausible way of choosing the best sample size n would 
be to maximize the expected utility associated with this experiment, 

U(n) = J J[U(n,y,6)-C(n,y,9)]f(y\0)g(0)dyd9 , (4.22) 

where, as usual, f(y\9) is the likelihood and g(0) the chosen prior. That is, 
we are choosing the design which maximizes the net payoff we expect using 
this sample size n, where as before we take the preposterior expectation 
over both the parameter 9 and the (as yet unobserved) data y. Provided 
the per sample cost c > 0, U(n) will typically be concave down, with 
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expected utility at first increasing in n but eventually declining as the 
cost of continued sampling becomes prohibitive. Of course, since it maybe 
difficult to assess c in practice, one might instead simply set c = 0, meaning 
that U(n) would increase without bound. \\'e could then select n as the 
first sample size that delivers a given utility, equivalent to the constrained 
minimization in condition (4.21) above when l (n.y. 9) — —1(6. 9 n ( y)). 

Another difficulty is that one or both of the integrals in (4.22) may be 
intractable and hence require numerical integration methods. Fortunately, 
Monte Carlo draws are readily available by simple composition: similar 
to the process outlined in Figure 3.3. we would repeatedly draw 0j from 
g(6) followed by yj from /(y|0j). producing a joint preposterior sample 
{((h, y*),J = 1,-.. N }. The associated Monte Carlo estimate of U(n) is 

1 - v 

K(n) = — 5jnn.yj.0J) - C(n. yj. 6 *)] . (4.23) 

j = i 

Repeating this calculation over a grid of n values { n t i =1. /}, we 

obtain a point wise Monte Carlo estimate of the true curve. The optimal 
sample size n is the one for which U(n,) is a max. 

When exact pointwise evaluation of U(rii) is prohibitively time consum¬ 
ing due to the size of I or the dimension of (y, 9), Miiller and Parmi- 
giam (1995) recommend simply drawing a single (t/j, 6 ’) from the joint 
distribution for a randomly selected design n,, and plotting the integrands 
Ui = U{n. yj, ()*) - C(n, yj, 6‘) versus n,-. Using either traditional para¬ 
metric (e.g., nonlinear regression) or nonparametric (e.g.. loess smoothing) 
methods, we may then fit a smooth curve U(n) to the resulting point cloud, 
and finally take h = argmax U(n). Notice that by ignoring the integrals in 
(4.22) we are essentially using a Monte Carlo sample size of just N = 1. 
However, results are obtained much faster, due to the dramatic reduction in 
function evaluations, and may even be more accurate in higher dimensions, 
where the "brute force" Monte Carlo approach (4.23) may not produce a 
curve smooth enough for a deterministic optimization subroutine. 

Example 4.4 Muller and Parmigiani (1995) consider optimal selection of 
the sample size for a binomial experiment. Let f(y\9) by the usual binomial 
likelihood (”)0 S (1 — 9) n ~ y and p(0) be the chosen prior - say, a Unif( 0, 1) 
(a noninformative choice). If we adopt absolute error loss (AEL), an ap¬ 
propriate choice of payoff function is 

U(n, y,0) = -\m y -6\ , 

where m. y is the posterior median, the Bayes rule under AEL; see Ap¬ 
pendix B, problem 3. Adopting a sampling cost of .0008 per observation. 
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(4.22) then becomes 


U(n) = 


-1 n 

l 

Jo y=o 


|my -9\- .0008n] ( )0»(1 - 9) n ~ y dO . (4.24) 


Here the y sum is over a finite range, so Monte Carlo integration is needed 
only for 9. We thus set up a grid of n,- values from 0 to 120, and for each. 


draw 9j 


iid 


Unif(0, 1) j = I.... N, and obtain 


*?(-) = ^ ED-K - r,\- .ooosn^ W<i - err -*, (4.25) 

v j-0y=0 ' 

where m y can be derived numerically from the Beia(y + 1. n, — y + 1) 
posterior. Again, h — argmax%). 

Miiller and Parmigiam (1995) actually note that in this case the 9 integral 
can be done in closed form after interchanging the order of the integral and 
the sum in (4.24), so Monte Carlo methods are not actually needed at all. 
Alternatively, the quick estimate using N = 1 and smoothing the resulting 
(rii, Ui ) point cloud performs quite well in this example. ■ 


In summary, our intent in this chapter has been to show that procedures 
derived using Bayesian machinery and vague priors typically enjoy good 
frequentist and EB (preposterior) risk properties as well. The Bayesian ap¬ 
proach also allows the proper propagation of uncertainty throughout the 
various stages of the model, and provides a convenient roadmap for obtain¬ 
ing estimates in complicated models. By contrast, frequentist approaches 
tend to rely on clever "tricks," such as sufficient statistics and pivotal quan¬ 
tities, that are often precluded in complex or high-dimensional models. All 
of this suggests the use of Bayesian methodology as an engine for producing 
procedures that will work well in a variety of situations. 

However, there is an obvious problem with this "Bayes as procedure en¬ 
gine" school of thought: our ability (or lack thereof) to obtain posterior dis¬ 
tributions in closed form. Fortunately, the recent widespread availability of 
high-speed computers combined with the simultaneous rapid methodologi¬ 
cal development in integration via approximate and Monte Carlo methods 
has provided an answer to this question. These modern computational ap¬ 
proaches, along with guidance on their proper use and several illustrative 
examples, are the subject of the next chapter. 


4.6 Exercises 

1. In the bivariate Gaussian example of Subsection 4.1.2, show that if £ = 
I, then B has the form (4.2). 

2. In the beta/binomial point estimation setting of Subsection 4.2.2, 
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(a) For /i = .5 and general n and M, find the region where the Bayes rule 
has smaller risk than the MLE. 


(b) Find the Bayes rule when the loss function is 

(a - ef 


L(9, a) = 


9(1-9) 


3. In the beta/binomial interval estimation setting of Subsection 4.3.1, 

(a) Under what condition on the prior will the HPD interval for 9 be 
one-sided when X = n? Find this interval. 

(b) Outline a specific interval-halving algorithm to find the HPD interval 
(9l,9jj)- Use your algorithm to find the 959 ? HPD interval when 
a = b = 1, n = 5, and x = 1- What percentage reduction in width 
does the HPD interval offer over the equal tail interval in this case? 


4. Repeat the analysis plan presented for the beta/binomial in Subsec¬ 
tion 4.3.1 for the Gaussian/Gaussian model. 


(a) Show that without loss of generality you can take the prior mean (fi 
= 0) and the sampling variance cr =1). 

(b) Show that the highest posterior probability interval is also equal¬ 
tailed. 

(c) Evaluate the frequentist coverage for combinations of the prior vari¬ 
ance (r 2 ) and the parameter of interest (6). 

(d) Also, evaluate the Bayesian posterior coverage and pre-posterior cov¬ 
erage of the interval based on (1 = 0, when the true prior mean is not 
equal to 0. 

( Hint: All probabilities can be represented as Gaussian integrals, and 
the interval endpoints come directly from the Gaussian cumulative 
distribution function.) 

5. Show that if the sampling variance is unknown and has an inverse gamma 
prior distribution, then in the limit as information in this inverse gamma 
goes to 0 and as r 2 —* oo, Bayesian intervals are Student's f-intervals. 

6. Actually carry out the analysis in Example 4.4, where the prior g(6) in 
(4.22) is 

(a) the Uni f (0,1) prior used in the example, g(9) = 1; 

(b) a mixture of two Beta priors, g(0) = .5 • W 2 + .5 • 30# 2 (1 — 9 2 ). 

What is the optimal sample size n in each case? Is either prior amenable 
to a fully analytical solution (i.e, without resort to Monte Carlo sam¬ 
pling)? 

7. Using equations (4.18) and (4.20), verify that h > ng. Will this relation 
hold for all G? (Hint: see equation (4.19)). 
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CHAPTER 5 


Bayesian computation 


5.1 Introduction 

As discussed in Chapter 2, determination of posterior distributions comes 
down to the evaluation of complex, often high-dimensional integrals (i.e., 
the denominator of expression (2.1)). In addition, posterior summarization 
often involves computing moments or quantiles, which leads to more in¬ 
tegration (i.e., now integrating the numerator of expression (2.1)). Some 
assistance was provided in Subsection 2.2.2, which described how conju¬ 
gate prior forms may often be found that enable at least partial analytic 
evaluation of these integrals. Still, in all but the simplest model settings 
(typically linear models with normal likelihoods), some intractable integra¬ 
tions remain. 

The earliest solution to this problem involved using asymptotic methods 
to obtain analytic approximations to the posterior density. The simplest 
such result is to use a normal approximation to the posterior, essentially a 
Bayesian version of the Central Limit Theorem. More complicated asymp¬ 
totic techniques, such as Laplace's method (Tierney and Kadane, 1986), en¬ 
able more accurate, possibly asymmetric posterior approximations. These 
methods are the subject of Subsection 5.2. 

When approximate methods are intractable or result in insufficient accu¬ 
racy, we must resort to numerical integration. For many years, traditional 
methods such as Gaussian quadrature were the methods of choice for this 
purpose; see Thisted (1988, Chapter 5) for a general discussion of these 
methods, and Naylor and Smith (1982) for their application to posterior 
computation. But their application was limited to models of low dimension 
(say, up to 10 parameters), since the necessary formulae become intractable 
for larger models. They also suffer from the so-called "curse of dimension¬ 
ality," namely, the fact that the number of function evaluations required 
to maintain a given accuracy level depends exponentially on the number 
of dimensions. An alternative is provided by the expectation-maximization 
(EM) algorithm of Dempster, Laird, and Rubin (1977), previously discussed 
in Section 3.4. Originally developed as a tool for finding MLEs in missing 
data problems, it applies equally well to finding the mode of a posterior 
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distribution in many Bayesian settings. Indeed, much of the subsequent 
work broadening the scope of the algorithm to provide standard error esti¬ 
mates and other relevant summary information was motivated by Bayesian 
applications. 

While the EM algorithm has seen enormous practical application, it is 
a fairly slow algorithm, and remains primarily geared toward finding the 
posterior mode, rather than estimation of the whole posterior distribu¬ 
tion. As a result, most applied Bayesians have turned to Monte Carlo 
integration methods, which provide more complete information and are 
comparatively easy to program, even for very high-dimensional models. 
These methods include traditional noniterative methods such as impor¬ 
tance sampling (Geweke, 1989) and simple rejection sampling, and are the 
subject of Section 5.3. Even more powerful are the more recently developed 
iterative Monte Carlo methods, including data augmentation (Tanner and 
Wong, 1987), the Metropolis-Hastings algorithm (Hastings, 1970). and the 
Gibbs sampler (Geman and Geman, 1984; Gelfand and Smith. 1990). These 
methods produce a Markov chain, the output of which corresponds to a 
(correlated) sample from the joint posterior distribution. These methods, 
along with methods useful in assessing their convergence, are presented in 
Section 5.4. While the discussion in this final section is substantial (by far 
the longest in the chapter), space does not permit a full description of this 
enormous, rapidly growing area. A wealth of additional material may be 
found in the recent textbooks by Gilks et al. (1996), Gamerman (1997). 
Robert and Casella (1999), and Chen et al. (2000). 

5.2 Asymptotic methods 

5.2.1 Normal approximation 

When the number of datapoints is fairly large, the likelihood will be quite 
peaked, and small changes in the prior will have little effect on the resulting 
posterior distribution. Some authors refer to this as "stable estimation." 
with the likelihood concentrated in a small region SI C 0 where rr(9) is 
nearly constant. In this situation, the following theorem shows that p(#|x) 
will be approximately normal. 

Theorem 5.1 Suppose that .Vj,..., X n ~ /,- {x t \0), and thus /(x|0) = 
nr = i Suppose the prior tt (0) and f(x\0) are positive, and twice 

7T 

differentiable near 9 the posterior mode (or "generalized MLE") of 6, 
assumed to exist. Then under suitable regularity conditions, the posterior 
distribution p(0|x) for large n can be approximated by a normal distribu¬ 
tion having mean equal to the posterior mode, and covariance matrix equal 
to minus the inverse Hessian (second derivative matrix) of the log poste¬ 
rior evaluated at the mode. This matrix is sometimes notated as [Z’ r (x)] — 1 , 
since it is the "generalized" observed Fisher information matrix for 9. More 
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specifically. 


%(*) = - 


d 2 


89 ; 89 , 


■log(/(x|0)x(0)) 


0=0 


Other forms of the normal approximation are occasionally used. For in¬ 
stance. if the prior is reasonably flat we might ignore it in the above cal¬ 
culations. This in effect replaces the posterior mode 9 W by the MLE 9 , and 
the generalized observed Fisher information matrix by the usual observed 
Fisher information matrix, I(x), where 


Iij(x) 


8 2 


89; 89, 


log /(x|0) 


6=0 


£ 

1=1 


8 2 


89; 89, 


logfi(x,\9) 


J 6=6 


Alternatively, we might replace the posterior mode by the posterior mean 
/F(x), and replace the variance estimate based on observed Fisher infor¬ 
mation with the posterior covariance matrix I/’ r (x)), or even the expected 
Fisher information matrix 1(9), which in the case of lid sampling is given 
by 


lij(0) — n ^x 1 \0 




The options listed in Theorem 5.1 are the most commonly used, however, 
since adding the prior distribution ir(6) complicates matters little, and 
the remaining options (e.g., computing a mean instead of a mode) involve 
integrations - the very act the normal approximation was designed to avoid. 

Just as the well-known Central Limit Theorem enables a broad range 
of frequentist inference by showing that the sampling distribution of the 
sample mean X is asymptotically normal, Theorem 5.1 enables a broad 
range of Bayesian inference by showing that the posterior distribution of 
a continuous parameter 0 is also asymptotically normal. For this reason. 
Theorem 5.1 is sometimes referred to as the Bayesian Central Limit The¬ 
orem. We do not attempt a general proof of the theorem here, but provide 
an outline in the unidimensional case. Write 1(6) — log f(x\9)ir(6) and use 
a Taylor expansion of 1(9) as follows: 


p(9\x) oc 


OC 


f(x\6)tr(0) = e lo sAxl*K») _ 


exp | l(6)- l -F(x)(6-9) 2 

N(9\ [Fix)]- 1 ) . 


e m 

( 0 - fl ) 2 
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The second term in the Taylor expansion is zero, since the log posterior has 
slope zero at its mode, while the third term gives rise to the generalized 
Fisher information matrix, completing the proof. ■ 

Example 5.1 Consider again the beta/binomial example of Section 2.3.4. 
Using a flat prior on 9, we have f(x\O)n(0) oc 9 X ( 1 — 9) n ~ T so that 1(9) = 
x log# + (n — x) log(l — 9). Taking the derivative of 1(9) and equating to 
zero, we obtain 6 W = x/n, the familiar binomial proportion. The second 
derivative is 

d 2 l(9) _ — x n — x 
09' 1 = lP~ (1 - 9) 2 ’ 

so that 

d 2 l(9) x n — x _ n n 

dP t=i * (l-#) 2 1-#' 

Thus [T 71 "(a?)] -1 = (f + yt?) = (fob)) = eJ ^’ which is the USUal 
frequentist expression for the variance of 9 = x/n. Thus, 


7t(#|a;) ~ N 


{*■ *¥>) 


Figure 5.1 shows this normal approximation to the posterior of # for the 
consumer preference data, previously analyzed in Section 2.3.4. Also shown 
is the exact posterior distribution, which we recall under the flat prior on 
0 is a Bend 14, 4) distribution. Notice that the two distributions have very 
similar modes, but that the approximate posterior density fails to capture 
the left skew in the exact posterior; it does not go to 0 at 9 = 1, and drops 
off too quickly as 0 decreases from § = 13/16 = .8125. This inaccuracy is 
also reflected in the corresponding equal tail credible sets: while the exact 
interval is (.57, .93), the one based on the normal distribution is (.62, 1.0). 
Had the observed x been even larger, the upper limit for the latter interval 
would have been greater than 1, a logical impossibility in this problem. 
Still, the difference is fairly subtle, especially given the small sample size 
involved (only 16). ■ 


5.2.2 Laplace's method 

The posterior approximation techniques of the previous subsection are of¬ 
ten referred to as first order approximations or modal approximations. Es¬ 
timates of posterior moments and quantiles may be obtained simply as the 
corresponding features of the approximating normal density. But as Fig¬ 
ure 5.1 shows, these estimates may be poor if the true posterior differs 
significantly from normality. This raises the question of whether we can 
obtain more accurate posterior estimates without significantly more effort 
in terms of higher order derivatives or complicated transformations. 
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Figure 5.1 Exact and approximate posterior distributions for Q consumer pref¬ 
erence data example. 

An affirmative answer to this question is provided by clever applica¬ 
tion of an expansion technique known as Laplace's Method. This technique 
draws its inspiration from work originally published in 1774 by Laplace 
(reprinted 1986); the most frequently cited paper on the subject is the 
rather more recent one by Tierney and Kadane (1986). The following ab¬ 
breviated description follows the presentation given by Kass, Tierney, and 
Kadane (1988). 

Suppose f is a smooth positive function of 0, and h is a smooth function 
of 0 with -h having unique maximum at 0. Again taking 0 to be univariate 
for simplicity, we wish to approximate 

1 = J f(6)e- nhi ^d.0 . 

Using Taylor series expansions of both f and li about 0 t we obtain 

x exp —nh(0) - nh'(0)(6 - 9) - nh"(6) — ^ d9 
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= e 


-nh(B ) 


/ 


f(e) + f(e)(9-6) + ^(e-9) 2 


x exp 


(e-ef 


2 (nh"(6j) 


-l 


de 


since h'{9) = 0 by the definition of 6. But now notice that the final, mul¬ 
tiplicative term inside the integral is proportional to a N(6. (nh'^O)) -1 ) 
density for 9, so that the second term inside the brackets vanishes since 
E(0 — 9) = 0. Thus we have 


I ss e 


nh (0) f(§\ 




where Var($) = [nh"(9)} 1 = 0(2-). In general, one can show for tri¬ 
dimensional 9 that / = /{l + 0(i)}, where 

/o \ 

I = J |£] 1/2 exp (-nh{9)} (5.1) 


and E = [D 2 h(9)]~ 1 . That is, 7 is a first order approximation to / in an 
asymptotic sense. 

Now, suppose we wish to compute the posterior expectation of a function 
g(9), where we think of - nh(9) as the unnormalized posterior density. 
log[/(y|0)7r(0)]. Then 


E(g{6)) = 


f g(9)e~ nh ^d9 
f e ~ nh (Q)d6 


(5.2) 


We may apply Laplace's method (5.1) to the numerator if = g) and de¬ 
nominator if — 1) to get 


E(g(9)) 


g{9) Cf) m/2 ISI^expt-nMA)) [l + Oi (j)] 

(£) m/2 |E| 1/2 exp(- nh(9))[l + 0 2 (^)} 

(q\ f 1 + °2 (^) - 0 2 (p) + Oi (£) 

m TkmJ) 


9(0) 



MS) 

i + o 2 Q) 


9(0) 


1 + 0 



showing that g{9 ) is a first order approximation to the posterior mean of 
g(9). Hence this naive application of Laplace's method produces the same 
estimator (with the same order of accuracy) as the modal approximation. 
But consider the following trick, introduced by Tierney and Kadane (1986): 
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provided g(0) > 0 for all 0, write the numerator integrand in (5.2) as 

glog g(0)-nh{0) _ e ~nh’[0) 


and now use Laplace's method (5.1) with /' = 1 in both the numerator and 
denominator. The result is 


E{g(S)) = 


exp(-nh*(9')) 
152| x / 2 exp(—nh(0)) 


1 + 0 - 

V n- 


where 0* is max of -/i* and £* = [D 2 h*(0 r )]~ 1 . similar to the defini¬ 
tions for 0 and £ above. This clever application of Laplace's method thus 
provides a second order approximation to the posterior mean of g(0). but 
still requires the calculation of only first and second derivatives of the log- 
posterior (although the presence of both numerator and denominator in 
(5.3) means we do need exactly twice as many of these calculations as we 
would to produce the simple normal approximation). By contrast, most 
other second-order expansion methods require computation of third- and 
perhaps higher-order derivatives. The improvement in accuracy in (5.3) 
comes since the leading terms in the two errors (both O (~) are identical, 
and thus cancel when the ratio is taken (see the appendix of Tierney and 
Kadane, 1986, for the details of this argument). 

For functions g(0) which are not strictly positive, a simple solution is 
to add a large constant c to g(0 ), apply Laplace's method (5.3) to this 
function, and finally subtract c to obtain the final answer. Alternatively, 
Tierney, Kass, and Kadane (1989) recommend applying Laplace's method 
to approximate the moment generating function L(exp[.s</(0)]). whose inte¬ 
grand is always positive, and then differentiating the result. These authors 
also show that these two approaches are asymptotically equivalent as c 
tends to infinity. 

Improved approximation of marginal densities is also possible using this 
method. Suppose 0 = ©j x 02, and we want an estimate of 


P{0 1 |y) 


q>[/(0i,0 2 )]d0 2 


where 1(0) = log(/(y|0)7r(0)) and c is an unknown normalizing constant. 
Taking / = 1 and h = n~ 1 l, we may use (5.3) again to get 

m |y) = c*|£(0i)| 1/ Vxp[^ 1 ,0o(0 1 ))] , (5.4) 


where c* is a new normalizing constant, 0 2 (^t) is th e maximum of 1(9 1 , •), 
and £((?!) is the inverse of the Flessian of -1(0 \ , •) at 0 2 (0i ). In order to 
plot this density, formula (5.4) would need to be evaluated over a grid 
of 0] values, after which the normalizing constant c could be estimated. 
This approximation has relative error of order 0(n~ 3 ^ 2 ) on neighborhoods 
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about the mode that shrink at the rate n ^ 2 . (By contrast, the normal 
approximation has error of order 0(n~ l / 2 ) on these same neighborhoods.) 

There are several advantages to Laplace's method. First, it is a compu¬ 
tationally quick procedure, since it does not require an iterative algorithm. 
It replaces numerical integration with numerical differentiation, which is 
often easier and more stable numerically. It is a deterministic algorithm 
(i.e.. it does not rely on random numbers), so two different analysts should 
be able to produce a common answer given the same dataset, model, and 
prior distribution. Finally, it greatly reduces the computational complex¬ 
ity in any study of robustness, i.e., an investigation of how sensitive our 
conclusions are to modest changes in the prior or likelihood function. For 
example, suppose we wish to find the posterior expectation of a function 
of interest g(9) under a new prior distribution, 7 r new(9)- We can write 


ENEw[g{6) |y] = 


f g(9)f(y\9)ir(9)b{9)d9 
f f(y\e)w(d)b(e)d9 


where b(-) is the appropriate perturbation function, which in this case is 
b(9) — Knew {9) / k(9). We can avoid starting the posterior calculations 
from scratch by using a result due to Kass, Tierney, and Kadane (1989), 
namely, 

E NE w[g(9)\y]K b 4V-E[g{9)\y], (5.5) 


where 9* and 9 maximize log[</(0)/(y|0)tr(0)] and log[/(y|0)tr(0)]. respec¬ 
tively. Note that if equation (5.3) was used to compute the original pos¬ 
terior expectation -Efg^jly], then 9* and 9 are already available. Hence 
any number of alternate priors may be investigated simply- by evaluating 
the ratios b(9*)/b(9); no new derivative calculations are required. A some¬ 
what more refined alternative is to obtain the original posterior expectation 
precisely using a more costly numerical quadrature or Monte Carlo integra¬ 
tion method, and then use equation (5.5) to carry out a quick sensitivity 
analysis. We shall return to this issue in Subsection 6.1.1. 

The Laplace method also has several limitations, however. For the ap¬ 
proximation to be valid, the posterior distribution must be unimodal, or 
nearly so. Its accuracy also depends on the parameterization used (e.g., 
9 versus log(#)), and the correct one may be difficult to ascertain. Simi¬ 
larly, the size of the dataset, n, must be fairly large; moreover, it is hard 
to judge how large is "large enough.” That is, the second-order accuracy 
provided by equation (5.3) is comforting, and is surely better than the 
first-order accuracy provided by the normal approximation, but, for any 
given dataset, we will still lack a numerical measurement of how far our 
approximate posterior expectations are from their exact values. Worse yet, 
since the asymptotics are in n, the size of the dataset, we will not be able to 
improve the accuracy of our approximations without collecting additional 
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data! Finally, for moderate- to high-dimensional 0 (say, bigger than 10), 
Laplace's method will rarely be of sufficient accuracy, and numerical com¬ 
putation of the associated Hessian matrices will be prohibitively difficult 
anyway. Unfortunately, high-dimensional models (such as random effects 
models that feature a parameter for every subject in the study) are fast 
becoming the norm in many branches of applied statistics. 

For these reasons, practitioners now often turn to iterative methods, 
especially those involving Monte Carlo sampling, for posterior calculation. 
These methods typically involve longer runtimes, but can be applied more 
generally and are often easier to program. As such, they are the subject of 
the remaining sections in this chapter. 

5.3 Noniterative Monte Carlo methods 

5.3.1 Direct sampling 

We begin with the most basic definition of Monte Carlo integration, found 
in many calculus texts. Suppose 9 ~ h(9) and we seek 7 = E[f(6)] — 

f f(9)h(9)d0. Then if 9 1 ,..., 0 N h(6), we have 

1 N 

T=^EM). (5.6) 

i = i 

which converges to E[f(6)] with probability 1 as N —> 00 , by the Strong 
Law of Large Numbers. In our case, h($) is a posterior distribution and 7 
is the posterior mean of f(0). Hence the computation of posterior expec¬ 
tations requires only a sample of size N from the posterior distribution. 

Notice that, in contrast to the methods of Section 5.2, the quality of the 
approximation in (5.6) improves as we increase N, the Monte Carlo sample 
size (which we choose), rather than n, the size of the dataset (which is 
typically beyond our control). Another contrast with asymptotic methods 
is that the structure of (5.6) also allows us to evaluate its accuracy for any 
fixed N. Since 7 is itself a sample mean of independent observations, we 
have that Var( 7 ) = Var[f(9)\/N. But Var[f(9)] can be estimated by the 
sample variance of the /(<?,) values, so that a standard error estimate for 
7 is given by 

sHl) = \ N(N — 1) (5 - 7) 

Finally, the Central Limit Theorem implies that 7 ± 2 se(~f) provides an 
approximate 95% confidence interval for the true value of the posterior 
mean 7 . Again, N may be chosen as large as necessary to provide any 
preset level of narrowness to this interval. While it may seem strange for 
a practical Bayesian textbook to recommend use of a frequentist interval 
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estimation procedure. Monte Carlo simulations provide one (and perhaps 
the only) example where they are clearly appropriate! 

In addition, letting I( a ,b) denote the indicator function of the set (a, b), 
note that 

p=P{a< f{0) < b\y} = £{/(„,»)[/(*)] |y} ■ 
so that an estimate of p is available simply as 

number of f(0j )s E (a, b) 

P= N ’ 

with associated binomial standard error estimate y/p( 1 — P)/N. In fact, this 
suggests that a histogram of the sampled 0 ;s would estimate the posterior 
itself, since the probability in each histogram bin converges to the true 
bin probability. Alternatively, we could use a kernel density estimate to 
"smooth" the histogram. 


N 


P(%) = 


Nh 


N 


E* 




h N 


where AT is a "kernel" density (typically a normal or rectangular distribu¬ 
tion) and /ijv is a window width satisfying h y —^ 0 and Nhff —> oo as 
N — oo. The interested reader is referred to the excellent book by Silver- 
man (1986) for more on this and other methods of density estimation. 

Example 5.2 Let Yj ~ N(p, cr 2 ),i = 1, . . .,n, and suppose we adopt the 
reference prior n(p, <r) — Then one can show (see Berger (1985, problem 
4.12), or Lee (1997, Section 2.12)) that the joint posterior of p and a~ is 
given by 

p|<7 2 , y ~ N{y,a 2 /n), 

and a 2 \y ~ Ky~h - K ■ IG 

where K = 2?=! (v* — .v ) 2 ■ We may thus generate samples from the joint 
posterior quite easily as follows. First sample crj ~ A’Xn-ii anc ^ then sample 
(ij ~ N(y, cr?/n), j = 1, . . ., N. This then creates the set {(pj. aj), j ~ 
1, . . Nj from p(p, <r 2 |y). To estimate the posterior mean of p, we would 
use 

1 N 

E(p\y) = y ■ 

3 = 1 

To obtain a 95% equal-tail posterior credible set for p,, we might simply 
use the empirical .025 and .975 quantiles of the sample of pj values. 

Estimates of functions of the parameters are also easily obtained. For 
example, suppose we seek an estimate of the distribution of 7 = o'/p, 
the coefficient of variation. We simply define the transformed Monte Carlo 
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samples 7 j — aj / pj, j = 1, . . N, and create a histogram or kernel density 
estimate based on these values. 

As a final illustration, suppose we wish to estimate P(Y n+ \ > cly), where 
Y„ + i is a new observation, not part of y. Writing 8 = (//. rr 2 ), we have 


P(Y n +i > c|y) = / I i c,o 0 )(yn+i)f{yn+i\y)dy n +i 

= J j I(c,oo)(y n +i)f(y n +i\0)p(0\y)d6dy n 


+1 


I(c t 00 ) (j/n + 1 )/(Vn + 1 \0)dy n + l 


p(8\y)d,8, 


the second equality coming from the fact that f(y n +i |0.y) = f(y n +i\0), 
since the V'-s are conditionally independent given 8. But now the quantity 
inside the brackets in the third line is nothing but P(Y ' n+1 > c\8) = 1 - 
$((c — p)/cr), where $ is the cdf of a standard normal distribution. Hence 

1 N 

p (Yn+i > c|y) « “ ®(( c ~ Pi)/ a i)] • 

since the 8j p(8 |y). ■ 


5.3.2 Indirect methods 

Example 5.2 suggests that given a sample from the posterior distribution, 
almost any quantity of interest can be estimated. But what if we can't 
directly sample from this distribution? This is an old problem that predates 
its interest by Bayesian statisticians by many years. As a result, there are 
several approaches one might try, of which we shall discuss only three: 
importance sampling, rejection sampling, and the weighted bootstrap. 


Importance sampling 


This approach is outlined carefully by Hammersley and Handscomb (1964); 
it has been championed for Bayesian analysis by Geweke (1989). Suppose 
we wish to approximate a posterior expectation, say 


E(f(6)\y) 


J f(O)L(e)ir(8)d0 
f L(0)ir(0)d0 


where for notational convenience we again suppress any dependence of the 
function of interesty and the likelihood L on the data y. Suppose we 
can roughly approximate the normalized likelihood times prior, cL(9)tt(0), 
by some density g(0 ) from which we can easily sample - say, a multi¬ 
variate t density, or perhaps a "split-r" (i.e., a t that uses possibly dif¬ 
ferent scale parameters on either side of the mode in each coordinate di- 
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rection; see Geweke, 1989, for details). Then defining the weight function 
w(9) = L{0)it{9)/g{9), we have 


E(f(6)\y) 


f f(9)w(9)g(9)d9 
f w(9)g(9)d9 

&Ef = i mM0j) 


(5.8) 


where 9j ~ g(8). Here, g (6) is called the importance function; how closely 
it resembles cL(6)n(9) controls how good the approximation in (5.8) is. To 
see this, note that if g (9) is a good approximation, the weights will all be 
roughly equal, which in turn will minimize the variance of the numerator 
and denominator (see Ripley, 1987, Exercise 5.3). If on the other hand 
is a poor approximation, many of the weights will be close to zero, and thus 
a few OjS will dominate the sums, producing an inaccurate approximation. 


Example 5.3 Suppose g(9) is taken to be the relatively light-tailed normal 
distribution, b but cL(9)ir{9) has much heavier, Cauchy-like tails. Then it 
will take many draws from g to obtain a few samples in these tails, and 
these points will have disproportionately large weights (since g will be small 
relative to cLit for these points), thus destabilizing the estimate (5.8). As 
a result, a very large N will be required to obtain an approximation of 
acceptable accuracy. ■ 


We may check the accuracy of approximation (5.8) using the following 
formula: 


Var 





where & 2 X = ^ £(xj -x) 2 , <7 2 = ^ £(% -</) 2 , and cr xy = £(xj - 
x)(iij — y). (We would of course plug in Xj = f(9j)w(9j) and yj — w(9j).) 

Finally, West (1992) recommends adaptive approximation of posterior 
densities using mixtures of multivariate t distributions. That is, after draw¬ 
ing a sample of size No from an initial importance sampling density g {) , we 
compute the weighted kernel density estimate 


No 

^( 0 ) = £>^(01 6 i , v h 2 No ). 

3 = 1 

Here, K is the density function of a multivariate t density with mode 6j 
and scale matrix Vh? No , where V is an estimate of the posterior covariance 
matrix and h ,v 0 is a kernel window width. We then iterate the procedure, 
drawing N\ importance samples from g\ and revising the mixture density 
to jf 2 ) an d so on until a suitably accurate estimate is obtained. 
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Figure 5.2 Unstandardized posterior distribution and proper rejection envelope. 


Rejection sampling 


This is an extremely general and quite common method of random gener¬ 
ation; excellent summaries are given in the books by Ripley (1987, Section 
3.2) and Devroye (1986, Section II.3). In this method, instead of trying to 
approximate the normalized posterior 


m = 


m*(Q) 

f L(0)ir(0)d0 ’ 


we try to "blanket" it. That is, suppose there exists an identifiable constant 
M > 0 and a smooth density g(0), called the envelope function such that 
L(9)tt(9) < Mg(6) for all 9 (this situation is illustrated in Figure 5.2). The 
rejection method proceeds as follows: 

(i) Generate 9j ~ g(9). 

(ii) Generate U ~ Uniform(0, 1). 

(iii) If MU g(9j ) < L(9j)tr(9j), accept Oj ; otherwise, reject 9j . 

(iv) Return to step (i) and repeat, until the desired sample Oj 

is obtained. The members of this sample will then be random variables 
from h(9). 
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A formal proof of this result is available in Devroye (1986, pp. 40-42) or 
Ripley (1987. pp. 60-62); we provide only a heuristic justification. Consider 
a fairly large sample of points generated from g(9). An appropriately scaled 
histogram of these points would have roughly the same shape as the curve 
labeled "Mg" in Figure 5.2. Now consider the histogram bar centered at the 
point labeled "a" in the figure. The rejection step in the above algorithm 
has the effect of slicing off the top portion of the bar (i.e.. the portion 
between the two curves), since only those points having - If ( -g {6j values 
below the lower curve are retained. But this is true for every potential value 
of "a" along the horizontal axis, so a histogram of the accepted 0j values 
would mimic the shape of the lower curve, which of course is proportional 
to the posterior distribution h(6), as desired. 

Intuition suggests that M should be chosen as small as possible, so as not 
to unnecessarily waste samples. This is easy to confirm, since if K denotes 
the number of iterations required to get one accepted candidate 9j then 
K is a geometric random variable, i.e., 

P(K = i) = (l-pY~ 1 p, (5.9) 

where p is the probability of acceptance. So P(K = i) decreases mono- 
tonically, and at an exponential rate. It is left as an exercise to show that 
p = c/M, where c is the normalizing constant for the posterior h(8). So 
our geometric distribution has mean E(K) p~ 1 = M/c, and thus we 
do indeed want to minimize M. Note that if h were available for selection 
as the g function, we would choose the minimal acceptable value M — c, 
obtaining an acceptance probability of 1 . 

Like an importance sampling density, the envelope density g should be 
similar to the posterior in general appearance, but with heavier tails and 
sharper infinite peaks, in order to assure that there are sufficiently many 
rejection candidates available across its entire domain. One also has to be 
careful that Mg is actually an "envelope" for the unnormalized posterior 
Lit. To see what happens if this condition is not met, suppose 


S M = {0 : L(9) tt(0) > Mg(0)} , 


the situation illustrated in Figure 5.3 with Sm = (a, b). Then the distribu¬ 
tion of the accepted Os is not h{9), but really 


h*(d) = < 


_ am 

Lc )dB + MP 9 (Sm) ’ 

M 

_ Mg(0) _ 

f sC ue)w(B)de + MP g (s M ) ' 


ees c M 
o e Sm 


(5.10) 


Unfortunately, J s ^ L(0)Tr(6)d9 > MP 3 (Sm), so even if P s (5m) is small, 
there is no guarantee that (5.10) is close to h(9). That is, only a few ob¬ 
served envelope violations do not necessarily imply a small inaccuracy in 
the posterior sample. 
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Figure 5.3 Unstandardizecl posterior distribution and deficient rejection envelope. 


As a possible solution, when we find a Oj such that L(9j )tr(6j ) > Mg(9j), 
we may do a local search in the neighborhood of Oj and increase M accord¬ 
ingly. Of course, we should really go back and recheck all of the previously 
accepted OjS. since some may no longer be acceptable with the new, larger 
M. We discuss rejection algorithms designed to eliminate envelope viola¬ 
tions following Example 5.6 below. 


Weighted bootstrap 


This method was presented by Smith and Gelfand (1992), and is very 
similar to the sampling-importance resampling algorithm of Rubin (1988). 
Suppose an M appropriate for the rejection method is not readily available, 
but that we do have a sample Oy,.. ,6 n from some approximating density 
g(9). Define 


L(6M0i) 

g{0i) 


and 



Now draw 9* from the discrete distribution over {6\...9n} which places 
mass qi at #,•. Then 


9* ~ h(0) 


m<Q) 

f L(9)w(9)d6 ’ 
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with the approximation improving as N —* oo. This is a weighted bootstrap, 
since instead of resampling from the set [d ±,... 6 ^ } with equally likely 
probabilities of selection, we are resampling some points more often than 
others due to the unequal weighting. 


To see that the method does perform as advertised, notice that for the 
standard bootstrap, 

P{0* <a) 

5Z jy J(-oo,a]( 0 «) 

1 = 1 

N-t-OO 

EgI(-oo,a\(9) = f g(9)d9 

J — OO 


so that 0* is approximately distributed as g(0). For the weighted bootstrap. 


P{0* <a) = 


N —+oo 


N 


^ oo,q](^i) — 

( — 00 , a 


i — 1 


tp H$)i r(») t 

g(9) P-c 


1 : 


jp L(e)x(9) 

^3 g(9) 

h( 0 )d 0 


N E» = l w il(-<x,a]{@i) 

it EL 

]W = f -00 m*(0)de 
”. “ /“oo m*(0)d0 


so that 0 * is now approximately distributed as h( 6 ). as desired. 

Note that, similar to the previous two indirect sampling methods, we 
need g(9) & h(0), or else a very large N will be required to obtain acceptable 
accuracy. In particular, the "tail problem" mentioned before is potentially 
even more harmful here, since if there are no candidate 0 t s located in the 
tails of h{ 6 ) there is of course no way to resample them! 

In any of the three methods discussed above, if the prior 1 ^( 6 ) is proper, 
it can play the role of g( 6 ), as shown in the following example. 

Example 5.4 Suppose Y 1 ,...,Y n ~ N(9, a 2 ) and tt( 6 ) = Cauchy(y/, r), 
with cr 2 , n, and r known. The likelihood L( 6 ) oc exp {—^2 2Ji" = 1 (j/i — 9)“) 
is maximized at 6 = y. Let M = L( 6 ) in the rejection method, and let 
g(0) = 7 t( 0 ). Then 

L(0)w{0) < Mg(0) <=> L(0) 7 t( 0 ) < L{9) ir(0) 

^=> L(9) < L{0) . 


So we simply generate 0j ~ n{9), U ~ Uniform(0, 1), and accept Oj if 
MUg(6j) < L(9j)tr(9j ), i.e., if 


U < 


m) 

m 


Clearly this ratio is also the probability of accepting a 9j candidate. Hence 
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this approach will be quite inefficient unless ir(0) is not too flat relative 
to the likelihood L{ 9), so that most of the candidates have a reasonable 
chance of being accepted. Unfortunately, this will not normally be the case, 
since in most applications the data will carry much more information about 
6 than the prior. In these instances, setting g = i will also result in poor 
performance by importance sampling or the weighted bootstrap, since g 
will be a poor approximation to Lit . As a result, g should be chosen as the 
prior 7T as a last resort, when methods of finding a better approximation 
have failed. ■ 


5.4 Markov chain Monte Carlo methods 


5.4.1 Substitution sampling and data augmentation 


Importance sampling, rejection sampling, and the weighted bootstrap are 
all "one-off," or noniterative, methods: they draw a sample of size N, and 
stop. Hence there is no notion of the algorithm "converging" - we sim¬ 
ply require N sufficiently large. But for many problems, especially high¬ 
dimensional ones, it may be quite difficult or impossible to find an im¬ 
portance sampling density (or envelope function) which is an acceptably 
accurate approximation to the log posterior, but still easy to sample from. 

Fortunately, some problems of this type may be fairly easy to cast in a 
substitution framework. For example, consider the three-stage hierarchical 
model, where, using p as a generic symbol for a density function, we have a 
likelihood p(y\B), a prior p(6\rj), and a hyperprior p{ij). (Any or all of v, 9, 
and r] may be vectors, but for convenience we suppress this in the notation.) 
The most common example of a model of this type would be the normal 
means model, where y\0 N(y\9, 1), 9\r, ~ N(0\ V , 1), and v ~ N(r) |0,1). 

Assuming the prior is conjugate with the likelihood, we can easily compute 
the marginal distribution, 


P(y\v) = j P(yW)p(9\ri)d9 

so that the posterior distribution is given in closed form as 


p(%> P) = 


Also notice that 

p(r]\9,y) = 


p{Q,y,p) _ p(y\Q)p(0\p) 
p{y,p) p(ylp) 


p{y,8,y) 

p(Q,y) 

p{Q\p)p(p) 

p(0) 


p{0\p)p(p) 

J p(0\p)p(p)dp 

= p(v\0) ■ 


Note that this distribution will also be available in closed-form given the 
conjugacy of the hyperprior with the prior. 
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For inference on 9, we seek the marginal posterior, 

p(0\y) = I p(Q\y,v)p(v\y)dv , 

and we also have 

P(v\y) = j p(ri\9)p(9\y)d0 . 

Switching to more generic notation, we see we have a system of two linear 
integral equations of the form 

p{x) = f p(x\y)p(y)dy 
p(y) = f p(y\x)p{x)dx 1 

where p(x Iv) and p(y\x )are known, and we seek p(x). To solve this system, 
we can substitute the second equation into the first and then switch the 
order of integration, obtaining 


p{x) - J p(x\y) I p(y\x')p(x')dx'dy 
= J h(x, x')p(x')dx' , 


where h(x,x') = f p(x\y)p(y\x')dy Equation (5.12) is sometimes called a 
fixed point system, since if the true marginal density pipe) is inserted on the 
right hand side, we get p(x) out again on the left x ' is of course just a 
dummy argument). But what happens if we put in nearly p(x)‘l That is, 
suppose po(x) is some starting estimate of p(x). We calculate 


pi(x) = j h(x, x')p 0 (x')dx' = I h (p 0 (x)) , 

where l h denotes the integral operator indexed by h. Defining p i+l (x) = 
Ih(Pi{x)), i = 0, 1, . . ., does this iterative process converge? 

Theorem 5.2 Provided the joint density p(x, y) > 0 over the domain 
Wx_y, 

(a) p(x ), the true marginal density, is the unique solution to the system 

(5.11). 

(b) The sequence { p; (x )} converges monotonically in L\ to p(x) 

( c ) / | Pi{x) — p(x)| dx —> 0 exponentially in i. 

These results are proved by Tanner and Wong (1987) in the context of 
missing data problems. 

Thus we have an iterative process for obtaining the true marginal den¬ 
sity p(x). But of course, the process involves performing the integration 
in equation (5.12) analytically, which in general will not be possible (if 
it were, we would never have begun this line of reasoning in the first 
place!). Thus instead of doing the substitution analytically, suppose we 
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use a sampling-based approach, wherein we draw A' 0 ) ~ po(x), and then 
draw yC 1 ) ~ p(y\x i '°l). Note that the marginal distribution of yW is then 
Pi(y) = J p(y\x)po{x)dx. Next draw A^ 1 ) ~ p(x\y (1) ), so that A (1) has 
marginal distribution 


Pi(«) 


J p(x\y)pi(y)dy 

f h(x, x'jpo^x^dx' 


h(Po(x)) ■ 


By property (b) of Theorem 5.2, repeating this process produces pairs 
(A^, y(0) such that 


X ~ p(x ) and Y ~ p(y ) . 

That is, for i sufficiently large, A' (0 may be thought of as a sample from 
the true marginal density p(x). Notice that the luxury of avoiding the in¬ 
tegration in (5.12) has come at the price of obtaining not the marginal 
density p(x ) itself, but only a sample from this density. Also notice that, 
unlike the methods in Subsection 5.3, this algorithm is iterative, requiring 
sampling until "convergence" has obtained. Approaches of this type are 
commonly known as Markov chain Monle Carlo (or simply MCMC) meth¬ 
ods, since their output is produced by generating samples from a certain 
Markov chain. 

The algorithm outlined above is essentially the "data augmentation" al¬ 
gorithm of Tanner and Wong (1987), developed for model settings where X 
plays the role of the (possibly vector) parameter of interest and Y plays the 
role of a vector of missing data values. However, in order to obtain an esti¬ 
mate of the marginal density p(x ) (instead of merely a single sample from 
it), these authors complicate the algorithm slightly. At the i th stage of the 

algorithm, they generate y/‘\ ..., yW ~ pfj/lar^- 1 )), and then generate a 
single X value as 


, m 

x(%) - pi(x) = — . 

771 j = 1 

Here, pi(x) is a Monte Carlo estimate of p,(x) = / p(x\y)p,(y)dy. This 
produces a less variable X W sample at each iteration, but, more important 
it automatically produces a smooth estimate of the marginal density of 
interest p(x) at convergence of the algorithm. 

Tanner (1993) observes that m = 1 is sufficient for convergence of the 
algorithm, and refers to this approach as "chained data augmentation." In 
a subsequent paper, Gelfand and Smith (1990) instead recommended par¬ 
allel sampling, as illustrated in Figure 5.4. The initial values A'|°',.... Xm 1 
are sampled from some starting distribution po(x), or perhaps are chosen 
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Figure 5.4 Diagram of parallel substitution sampling. 


deterministically (say, A'j 0) = a for all j, where a is a naive estimate of the 
marginal mean of X, such as the prior mean). The advantage of this ap¬ 
proach is that at convergence it produces m replicates that are marginally 
ifidepetidetit, and thus a presumably better density estimate pi(x). An ob¬ 
vious disadvantage, however, is the wastefulness of the algorithm, which 
discards all but m of the 2 im random variates it generates. 

As an alternative, many authors (notably Geyer, 1992) have advocated 
the use of ergodic sampling. Here, we generate only one of the columns of 
variates illustrated in Figure 5.4, but continue sampling for an additional 
m - 1 iterations after convergence at iteration i. Our density estimate is 
then given by 

1 m 

Pi{x) = — Vp(tE|t/ (!+J-1) ) . (5.13) 

m ' 

3 -1 

That is, we use consecutive variates from a single long sampling chain. 
While these variates will not be independent, they will all come from the 
proper marginal distribution (the ergodic distribution of the chain), and 
hence the density estimate in (5.13) will indeed be a good approximation 
to p(x) for i and m sufficiently large. However, the approximation will 
be poor if X and Y are highly correlated, since this will lead to high 
autocorrelation ("slow mixing") in the resulting sequence. That is, the 
chain might get "stuck" in one part of the joint distribution, leading to 
a density estimate based on samples that represent only a portion of the 
distribution. To remedy this, some authors (notably Raftery and Lewis, 
1992) have recommended retaining only every k th iterate after convergence 
has obtained, resulting in the revised ergodic density estimate 

. m 

Pi{x) - — . 

TTl 

3 = 1 

Clearly in all of the above formulae, the proper selection of i, m , and per¬ 
haps k is critical to the success of the approach. In the interest of continuing 
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with our presentation of the various MCMC methods, we defer further dis¬ 
cussion of this issue to Subsection 5.4.6. 


5.4.2 Gibbs sampling 

While the algorithm of the previous subsection could conceivably be used in 
problems of dimension higher than two by thinking of X and Y as vectors, 
such an approach may not be feasible due to the difficulty in generating 
from complex multivariate distributions. As a result, we require a A"-variate 
extension of the substitution sampling approach. For example, consider the 
case K = 3. Certainly the system of two equations in two unknowns given 


in (5.11) could be extended to 



p(x) = 

f p(x. z\y)p(y)dydz 


p(y) = 

f p(y, x\z)p(z)dxdz , 

(5.14) 

p(z) = 

fp{z, y\x)p(x)dxdy 



leading to a fixed point equation of the form p(x) — J h(x, x")p(x" ), similar 
to (5.12). But a sampling-based version of this algorithm would require 
sampling from the three bivariate conditional distributions in (5.14), which 
are unlikely to be available in closed form. Instead, we might write the 
system solely in terms of univariate distributions as 

p(x) - J p(x\z,y)p(z\y)p(y)dydz 

p(y ) = f p(y\x,z)p(x\z)p(z)dxdz . (5.15) 

P( z ) = f P(z\y,x)p(y\x)p(x)dxdy 

A sampling implementation could now proceed using the 6 univariate con¬ 
ditional distributions in (5.15), assuming that all 6 were available. Note 
that three of these distributions are not fully conditional, but require some 
marginalization beforehand - perhaps carried out by subsampling algo¬ 
rithms. 

Now consider the general A'-dimensional problem, where we have a col¬ 
lection of K random variables which we denote by U = (U\,..., Uk)- 
A univariate substitution approach would require sampling from a chain 
of K(K - 1) distributions, which is likely to be impractical for large K. 
Fortunately, in many situations a much simpler alternative is provided 
by an algorithm which has come to be known as the Gibbs sampler. In 
this method, we suppose the frill or complete conditional distributions 
{Pi{Ui\Ujti\ i = \..... K\ are available for sampling. (Here, "available" 
means that samples may be generated by some method, given values of the 
appropriate conditioning random variables.) Under mild conditions (see 
Besag, 1974), these complete conditional distributions uniquely determine 
the full joint distribution, p(U i,..., Uk), and hence all marginal distribu¬ 
tions p(Ui ), / = K. Then, given an arbitrary set of starting values 
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{f/}° 5 , ..., Uk'}, the algorithm proceeds as follows: 

DrawC^ ~ Pl (lh |/7} 0) ,..., Up). 

Draw uP ~ P2(U 2 \U[ 1] , Up, ..., Up), 


Drawee 1 ’ ~ pk(U k \u[ 1] , ..., U^U), 
completing one iteration of the scheme. After t such iterations we would 
then obtain (Up,..., Up). 

Theorem 5.3 For the Gibbs sampling algorithm outlined above. 

(a) (Up, .. . Up) 4 (Ux ■ ■ ■ U K ) ~ p{Ui ■ ■ ■ U K ) as t - oo. 

(b) The convergence in part (a) is exponential in t using the L\ norm. 

These results were originally proved for discrete distributions by Geman 
and Geman (1984) in the context of image restoration using Gibbs dis¬ 
tributions (hence, the algorithm's somewhat confusing name). Schervish 
and Carlin (1992) provided conditions for convergence using the L 2 norm; 
Roberts and Smith (1993) provide conditions that are even more general 
and simpler to verify. 

Thus, all we require to obtain samples from the joint distribution of 
(Ui,... Uk) is the ability to sample from the Ii corresponding full con¬ 
ditional distributions. In the context of Bayesian analysis, the joint dis¬ 
tribution of interest is the joint posterior distribution, p{U\,..., Uk |y ), 
or perhaps one or more of the marginal posterior distributions, p(Ui |y). 
The Gibbs sampler then requires draws from the full conditional posterior 
distributions p{Ui\Uj^i,y). Fortunately, the conditioning inherent in these 
distributions implies that for a very large class of hierarchical models with 
conjugate priors and hyperpriors, the necessary sampling distributions will 
be available in closed form (see Examples 5.5 and 5.6 below). Furthermore, 
marginal density estimates will be available in a manner similar to that 
described in the previous subsection. For instance, using a few (in = 3 or 
5) parallel sampling chains run for a total of T iterations each, if an accept¬ 
able degree of convergence had obtained by iteration to, then a marginal 
posterior density estimate for Ui would be given by 


P(Ui |y) = 


1 


m(T -1 0 ) 


E E 

j=i <=<0+1 


(0 


//(*) {/(*) . 


• u kj .y) 


(5.16) 

This formula makes use of the fact that at convergence, by part (a) of Theo¬ 
rem 5.3 the (AT-l)-tuples (Up ,..., j , j , •. - Upj) are distributed 
according to the joint posterior distribution of (JJ\ .,Ui- 1, t/j+i,..., U K ). 
We remark that the smoothing inherent in formula (5.16) leads to a less 


© 2000 by CRC Press LLC 



variable estimate of p(Ui |y) than could be obtained by kernel density esti¬ 
mation, which would ignore the known shape of U{ ' s full conditional distri¬ 
bution. Gelfand and Smith (1990) refer to this process as "Rao-Blackwell- 
ization," since the conditioning and resulting variance reduction is reminis¬ 
cent of the well-known Rao-Blackwell Theorem. It does require knowledge 
of the normalized form of the full conditional for U {; if we lack this, we 
can resort to an importance-weighted marginal density estimate (see Chen, 
1994) or simply a kernel density estimate (Silverman, 1986, Chapter 3). 


Example 5.5 As a first, elementary illustration of the Gibbs sampler we 
reconsider the Poisson/gamma model of Example 2.5, as expanded in Chap¬ 
ter 3, Exercise 9. That is. 


Yi\0i ~ d PoissonidiU), 0 t G{a, /?), * = 1,..Jb, 

where the are known constants. For the moment we assume that a is also 
known, but that 0 has an IG(c, d) hyperprior, with known hyperparameters 
c and d. Thus the density functions corresponding to the three stages of 
the model are 


g(0i\P) 

and h(3) 




Vi'- 

0 ? _ 1 e - 9i /0 


Vi > 0 , $i> 0 , 


T(a)0 a 

e -i K0d) 

T(c)d e 3 c+l 


, a > 0, /3 > 0, 
, c > 0, d > 0. 


Our main interest lies in the marginal posterior distributions of the 0;, 
p(6i |y). Note that while the gamma prior is conjugate with the Poisson 
likelihood and the inverse gamma hyperprior is conjugate with the gamma 
prior, no closed form for p(0,|y) is available. However, the full conditional 
distributions of ft and the 6{ needed to implement the Gibbs sampler are 
readily available. To see this, recall that by Bayes' Rule, each is proportional 
to the complete Bayesian model specification. 


k 

j =i 


m 


(5.17) 


Thus we can find the necessary full conditional distributions by dropping 
irrelevant terms from (5.17) and normalizing. For 6{ we have 


P(0i|0;?si./?.y) 


°c f(yi\Oi)g(0i\P) 

a ^•+«- 1 e -9.(t.+l/^) 

0(1 G (0i | yi + a , (ti + 1/3) x ) , 


(5.18) 
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while for /? we have 


p{P\{0i], y) oc 


OC 


oc 


OC 


urn?) 

e -e,/p 


i —1 
' k 

n 


/?<* 


M/?) 

e -l/(0d) 


pc+1 


pka+c+ 1 

7G J /? I fca + c, | ^ + 1/d 


c*=i 


(5.19) 


Thus our conditionally conjugate prior specification (stage 2 with stage 1, 
and stage 3 with stage 2) has allowed both of these distributions to emerge 
as standard forms, and the {9^ ’} and /? 0 ) may be sampled directly from 
(5.18) and (5.19), respectively. 

As a specific illustration, we apply this approach to the pump data in 
Table 3.5, previously analyzed in Chapter 3, Exercise 10. These are the 
numbers of pump failures, Y t , observed in t t thousands of hours for k = 10 
different systems of a certain nuclear power plant. We complicate matters 
somewhat by also treating a as an unknown parameter to be estimated. 
The complication is that its full conditional, 


p{a\{9i},^,y) oc 


' k 

.i= 1 


h(a) , 


is not proportional to any standard family for any choice of hyperprior 
h(a). However, if we select an Expo(fi) hyperprior for q, then 


p(a\{9t},/3, y) oc 


n 


er 1 

T(a)0 a 


Ot/li 


which can be shown to be log-concave in Thus we may use the adaptive 
rejection sampling (ARS) algorithm of Gilks and Wild (1992) to update 
this parameter. 

The WinBUGS code to fit this model is given in Figure 5.5. We chose 
the values fi — 1 , c — 0 . 1 , and cl = 1 . 0 , resulting in reasonably vague 
hyperpriors for a and 0 . WinBUGS recognizes the conjugate forms (5.18) and 
(5.19) and samples directly from them; it also recognizes the log-concavity 
for a and uses ARS for this parameter. 

The results from running 1000 burn-in samples, followed by a "produc¬ 
tion" run of 10,000 samples (single chain) are given in Table 5.2. Note that 
while 9 5 and 9e, have very similar posterior means, the latter posterior is 
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i 

Yi 

U 

n 

i 

5 

94.320 

.053 

2 

1 

15.720 

.064 

3 

5 

62.880 

.080 

4 

14 

125.760 

.111 

5 

3 

5.240 

.573 

6 

19 

31.440 

.604 

7 

1 

1.048 

.954 

8 

1 

1.048 

.954 

9 

4 

2.096 

1.910 

10 

22 

10.480 

2.099 


Table 5.1 Pump Failure Data (Gayer and O'Muircheartaigh, 1987) 


model 

{ 

for (i in 1:k) { 

theta[i] * dgamma(alpha,beta); 
lambda[i] <- theta[i]*t[i]; 

Y[i] " dpois(lambda[i]); 

} 

alpha ' dexp(l.O); 
beta * dgamma(0.1, 1.0); 

> 

# DATA: 

list(k = 10, Y = c(5, 1, 5, 14, 3, 19, 1, 1, 4, 22), 
t = c(94.320, 15.72, 62.88, 125.76, 5.24, 31.44, 1.048, 
1.048, 2.096, 10.48)) 

# INITIAL VALUES: 

list(theta=c(l,1,1,1,1,1,1,1,1,1), alpha=l, beta=l) 

Figure 5.5 WinBUGS code for solving the pump data problem. 


much narrower (i.e., smaller posterior standard deviation). This is because, 
while the crude failure rates for the two pumps are similar, the latter is 
based on a far greater number of hours of observation j f . = 31.44, while 
1 5 = 5.24). Hence we "know" more about pump 6, and this is properly 
reflected in its posterior distribution. ■ 
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parameter 

mean 

sd 

2.5% 

median 

97.5% 

a 

0.7001 

0.2699 

0.2851 

0.6634 

1.338 

p 

0.9290 

0.5325 

0.1938 

0.8315 

2.205 


0.0598 

0.02542 

0.02128 

0.05627 

0.1195 

05 

0.6056 

0.3150 

0.1529 

0.5529 

1.359 

06 

0.6105 

0.1393 

0.3668 

0.5996 

0.9096 

010 

1.993 

0.4251 

1.264 

1.958 

2.916 


Table 5.2 Posterior summaries, pump data problem 


Example 5.6 Our second Gibbs sampling example concerns a longitudinal 
dataset on the weights of young laboratory rats, given in Table 5.3 and 
originally considered by Gelfand, Hills, Racine-Poop, and Smith (1990). 
While this dataset is already something of a "golden oldie" in the MCMC 
Bayesian literature, we include it since it is a problem of moderate difficulty 
that exploits the sampler in the context of the normal linear model, common 
in practice. In these data, Y z ,• corresponds to the weight of the i th rat at 
measurement point j, while jpj denotes its age in days at this point. We 
adopt the model 

Yij ~ d N (on + fcxij , <r 2 ) , i = 1,.,., k, j = 1,..., rtj , (5.20) 

where k = 30 and n, = 5 for all i (unlike humans, laboratory rats can 
be counted upon to show up for all of their office visits!). Thus the model 
supposes a straight-line growth curve with homogeneous errors, but allows 
for the possibility of a different slope and intercept for each individual 
rat. Still, we believe all the growth curves to be similar, which we express 
statistically by assuming that the slope-intercept vectors are drawn from a 
common normal population, i.e., 

8 ‘ h C;)~"( 8os &)' e )' i=i .* (5 ' 2i) 

Thus we have what a classical statistician might refer to as a random effects 
model (or a random coefficient model). 

To effect a Bayesian analysis, we must complete the hierarchical structure 
at the second stage with a prior distribution for <r 2 and at the third stage 
with prior distributions for 8q and S. We do this by choosing conjugate 
forms for each of these distributions, namely, 

a* ~ IG{a, b ) , 
do ~ .V(tj. C) , and 

5T 1 ~ W{(pR)-\ p) , (5.22) 
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Rat Weight measurements 


i 

Yn 

Yi 2 

Y i3 

Yi 4 

Yi 5 

i 

151 

199 

246 

283 

320 

2 

145 

199 

249 

293 

354 

3 

147 

214 

263 

312 

328 

4 

155 

200 

237 

272 

297 

5 

135 

188 

230 

280 

323 

6 

159 

210 

252 

298 

331 

7 

141 

189 

231 

275 

305 

8 

159 

201 

248 

297 

338 

9 

177 

236 

285 

340 

376 

10 

134 

182 

220 

260 

296 

11 

160 

208 

261 

313 

352 

12 

143 

188 

220 

273 

314 

13 

154 

200 

244 

289 

325 

14 

171 

221 

270 

326 

358 

15 

163 

216 

242 

281 

312 

16 

160 

207 

248 

288 

324 

17 

142 

187 

234 

280 

316 

18 

156 

203 

243 

283 

317 

19 

157 

212 

259 

307 

336 

20 

152 

203 

246 

286 

321 

21 

154 

205 

253 

298 

334 

22 

139 

190 

225 

267 

302 

23 

146 

191 

229 

272 

302 

24 

157 

211 

250 

285 

323 

25 

132 

185 

237 

286 

331 

26 

160 

207 

257 

303 

345 

27 

169 

216 

261 

295 

333 

28 

157 

205 

248 

289 

316 

29 

137 

180 

219 

258 

291 

30 

153 

200 

244 

286 

324 


Table 5.3 Rat population growth data (Gelfand et al, 1990, Table 3). 


where W denotes the Wishart distribution, a multivariate generalization of 
the gamma distribution. Here, R is a 2 x 2 matrix and p > 2 is a scalar 
"degrees of freedom" parameter. In more generic notation, a p x p symmetric 
and positive definite matrix V distributed as W(D, n ) has density function 
proportional to 


| y |(„_ p _ 1 )/2 

6XP 


- l -tr(D- l V) 


(5.23) 
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provided that n > p. For this distribution, one can show E(Vij) = nDij, 
Var(Vij) = n(Djj + DuDjj), and Cov(Vij, V k i) - n(D ik Djj + DuD jk ). 
Thus under the parametrization chosen in (5.22) we have E(T, 1 ) = R 1 , 
so R~ l is the expected prior precision of the #;s (i.e., R is approximately 
the expected prior variance). Also, Far(Ejj) is decreasing in p, so small p 
values correspond to vaguer prior distributions. For more on the Wishart 
distribution, the reader is referred to Press (1982, Chapter 5). 

The hyperparameters in our model are a, b.rj, C, p, and R, all of which 
are assumed known. We seek the marginal posterior for do given only the 
observed data. The total number of unknown parameters in our model is 66: 
30 ctjS, 30 PiS, ao, Po, c 2 , and the 3 unique components of E. This number 
is far too high for any approximation or numerical quadrature method, 
and the noniterative Monte Carlo methods in Section 5.3 would be very 
cumbersome. By contrast, the Gibbs sampler is relatively straightforward 
to implement here thanks to the conjugacy of the prior distributions at 
each stage in the hierarchy. For example, in finding the full conditional 
for 9i we are allowed to think of the remaining parameters as fixed. But 
as proved by Lindley and Smith (1972) and shown in Example 2.7, with 
variance parameters known the posterior distribution for 0, emerges as 

di\y, d 0 , £-\ <r 2 ~ N {Di{<T- 2 Xj yi + E -1 #o) . A) 
for i = 1,..., k, where D~ x = Xi + E -1 , 


/ Vn \ 


( 1 xn \ 

; ' 
\ yin , j 

, and Xi = 

\ 1 x in, / 


Similarly, the full conditional for 8o ' s 

0o|y, {0O.S- 1 , tri-NiVikX-'e + C-'ri), V) 


where V = (fcE -1 + C 1 and 0 = \ 

For v-t under the parametrization in (5.22), the prior distribution 
(5.23) is proportional to 


s _l|(p-3)/2 


exp 


-\tr{pRY,- x ) 


Combining the normal likelihood for the random effects (5.21) with this 
prior distribution produces the updated Wishart distribution 


E- 1 y, {0;}, 0„, <r 2 


W 


E?=i(«<- fl o)(0.--0o) T + ^]' 



Finally, the full conditional for a 2 is the updated inverse gamma distribu- 
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tion 


C 2 | y, {#;}, Oo, £ 
~ JG^f+a, 


lE‘=i(y i 


*.-0,-) T (yi - *<0.0 + 6 -1 



__ 

where n = 2^i=i n *- Thus the conditioning property of the Gibbs sampler 
enables closed form full conditional distributions for each parameter in 
the model, including the variance parameters - the sticking point in the 
original Lindley and Smith paper. Samples from all of these conditional 
distributions (except perhaps the Wishart) are easily obtained from any of 
a variety of statistical software packages. Wishart random deviates may be 
produced via an appropriate combination of normal and gamma random 
deviates. The reader is referred to the original paper on this topic by Odell 
and Feiveson (1966), or to the more specifically Bayesian treatment by 
Gelfand et al. (1990). 

In our dataset, each rat was weighed once a week for five consecutive 
weeks. In fact, the rats were all the same age at each weighing: xn = 8, 
Xi 2 = 15, Xiz = 22, xn = 29, and x^ = 36 for all i. As a result, we may 
simplify our computations by rewriting the likelihood (5.20) as 


Yij N (a,: + Pi(xij - x) , a' 2 ) 


i = j = !,•••,«« . 


so that it is now reasonable to think of a, and /?, as independent a priori. 
Thus we may set £ = Diag(a 2 , cr|), and replace the Wishart prior (5.22) 
with a product of independent inverse gamma priors, say IG(a a , b a ) and 
IG(a 0 ,b 0 ). 

We complete the prior specification by choosing hyperprior values that 
determine very vague priors, namely, C~ 1 = 0 (so that tj disappears entirely 
from the full conditionals), a = a a = a 0 = e, and b = b a = b 0 = 1/e where 
e = 0.001. Using the BUGS language (Spiegelhalter et al., 1995a), we ran 
three independent Gibbs sampling chains for 500 iterations each. These 
chains were started from three different points in the sample space that we 
believed to be overdispersed with respect to the posterior distribution, so 
that overlap among these chains would suggest convergence of the sampler. 
(We hasten to remind the reader that diagnosing convergence of MCMC 
algorithms is a very difficult problem, and is the subject of much further 
discussion in Subsection 5.4.6 below.) 

We summarized the Gibbs samples generated by BUGS using CODA (Best, 
Cowles, and Vines, 1995), a menu-driven collection of S+ routines useful 
for convergence diagnosis and output analysis. (See the Appendix Subsec¬ 
tion C.3.3 MCMC software guide for more information on the acquisition 
and use of BUGS and CODA.) The first column of Figure 5.6 displays the 
Gibbs samples obtained for the population slope and intercept at time zero 
(birth), the latter now being given by /i = no — dox due to our recentering of 
the Xij s. As a satisfactory degree of convergence seems to have occurred by 
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betaO 


Figure 5.6 Monitoring plots and kernel density estimates for p and 3a , Gibbs 
sampling analysis of the rat data. 


iteration 100, we use the output of all three chains over iterations 101-500 
to obtain the posterior density estimates shown in the second column of the 
figure. These are only kernel density estimates; more accurate results could 
be obtained via the Rao-Blackwellization approach in equation (5.16). It 
appears that the average rat weighs about 106 grams at birth, and gains 
about 6.2 grams per day. 

Gelfand et al. (1990) also discuss and illustrate the impact of missing data 
on the analysis. If Yu is missing, samples from its predictive distribution 
PiVij |y) are eas ily generated, since its complete conditional distribution 
is nothing but the normal distribution given in the likelihood expression 
(5.20). The predictive could then be estimated using a kernel smooth of 
these samples as in Figure 5.6, or a mixture density estimate of the form 
(5.16), namely, 

fKmly) = J <r 2 |y) 

9 = 1 
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where the superscript ( g ) indexes the post-convergence replications from 
the sampler. ■ 

In the previous example, the prior distribution of each parameter was 
chosen to be conjugate with the corresponding likelihood term that pre¬ 
ceded it in the hierarchy. This enabled all of the full conditional distribu¬ 
tions necessary for implementing the Gibbs sampler to emerge in closed 
form as members of familiar distributional families, so that sample genera¬ 
tion was straightforward. But in many examples, we may want to select a 
prior that is not conjugate; in others (e.g., nonlinear growth curve models), 
no conjugate prior will exist. How are we to perform the required sampling 
in such cases? 

Provided that the model specification (likelihood and priors) are available 
in closed form, any given full conditional will be available at least up to a 
constant of proportionality, since 

P(0i|{fy*i}.y) °c f(y\0)n(9) = L(9)ir(d) . (5.24) 

Hence the "nonconjugacy problem" for Gibbs sampling outlined in the 
previous paragraph is nothing but a univariate version of the basic compu¬ 
tation problem of sampling from nonstandardized densities. This suggests 
using one of the indirect methods given in Subsection 5.3.2 to sample from 
these full conditionals. Unfortunately, the fact that this problem is embed¬ 
ded within a Gibbs sampler makes some of these techniques prohibitively 
expensive to implement. For example, recall that the weighted bootstrap 
operates by resampling from a finite sample from an approximation to the 
target distribution with appropriately determined unequal probabilities of 
selection. Sampling from (5.24) at the t th iteration of the Gibbs algorithm 
would thus require drawing a large approximating sample, of which only 
one would be retained. Worse yet, this initial sample could not typically be 
reused, since the values of the conditioning random variables {dj^ti} will 
have changed by iteration (t + 1). 

Rejection sampling will have similarly high start-up costs when applied 
in this context, though several ingenious modifications to reduce them have 
been proposed in the MCMC Bayesian literature. Wakefield et al. (1991) 
developed a generalized version of the ratio-of-uniforms method (see e.g. 
Devroye, 1986, p.194), a tailored rejection method that is particularly useful 
for sampling from nonstandardized densities with heavy tails. Gilks and 
Wild (1992) instead use traditional rejection, but restrict attention to log- 
concave densities (i.e., ones for which the second derivative of the log of 
(5.24) with respect to 0, is nonpositive everywhere). For these densities, an 
envelope on the log scale can be formed by two lines, one of which is tangent 
to log p left of the mode, and the other tangent right of the mode. The 
exponential of this piecewise linear function is thus guaranteed to envelope 
exp(log p) = p. In a subsequent paper, Gilks (1992) developed a derivative- 
free version of this algorithm based on secant (instead of tangent) lines. 
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The log-concavity property can be a nuisance to check, but is typically not 
a serious restriction (though the t family is a notable exception). 

In an attempt to avoid having to "re-envelope" the target density at each 
iteration, Carlin and Gelfand (1991b) recommended a multivariate rejec¬ 
tion algorithm. In this approach, a single multivariate split-normal or split-f 
envelope function is obtained for L(9)tt(9) at the algorithm's outset. Cross- 
sectional slices of this envelope in the appropriate coordinate direction may 
then serve as univariate envelopes for the full conditional distributions re¬ 
gardless of the values of the conditioning random variables at a particular 
iteration. Unfortunately, while this approach greatly reduces the amount of 
computational overhead in the algorithm, the difficulty in finding an enve¬ 
lope whose shape even vaguely resembles that of L(9)ir(9) across 9 limits 
its applicability to fairly regular surfaces of moderate dimension. 

Yet another alternative is provided by Ritter and Tanner (1992), who, 
instead of rejection, recommend approximate inversion using a grid-based 
estimate of full conditional cdf, F(9i\{6j^i} y). Since F~ l (U) will have 
distribution F when U is distributed Uniform(0, 1), F ^ 1 ( U ) should have 
distribution Wroximately equal to F, with the quality of the approxima¬ 
tion improving as the mesh of the grid goes to zero. Unfortunately, the 
establishment of F is very numerically intensive, involving a large number 
of function evaluations, and the quality of the algorithm's output is difficult 
to assess. 

In summary, all of these solutions seem rather tedious to program, in¬ 
volve a large amount of computational overhead, and are of questionable 
accuracy (the rejection methods due to possible envelope violations; the 
inversion method due to the piecewise linear nature of the approximate 
cdf). Perhaps when we encounter a substantial number of unstandardized 
full conditionals, the problem is trying to tell us something, and we should 
look for an alternate Markov chain Monte Carlo method. The next subsec¬ 
tion investigates such a method. Subsection 5.4.4 goes on to show how the 
two approaches may be combined to form what is possibly the best (and 
certainly the easiest) solution to the nonconjugacy problem. 

5.4.3 Metropolis-Hastings algorithm 

Like the Gibbs sampler, the Metropolis-Hastings algorithm is an MCMC 
method that has been widely used in the mathematical physics and im¬ 
age restoration communities for several years, but only recently discovered 
by statisticians. The method in its simplest form dates to the paper by 
Metropolis et al. (1953), which we now summarize briefly. Suppose the 
true joint posterior for a (possibly vector-valued) parameter U has den¬ 
sity p(u) with respect to some dominating measure p. Choose an auxiliary 
function q(y, u) such that q(., u) is a pdf with respect top for all u, and 
q{u, v) = q(v, u) for all u, v (that is, q is symmetric in its arguments) 
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The function q is often called the candidate or proposal density. We then 
generate a Markov chain as follows: 

1. Draw v ~ q(-, u), where u = U d -1 ), the current state of the Markov 
chain; 

2. Compute the ratio r = p(v)/p(u); 

3. If r > 1, set. [/W = v; 


If r < 1, set, U (t > - 


v with probability r 
u with probability 1 — r 


Notice that the joint posterior density p is needed only up to proportionality 
constant, since it is only used to compute the acceptance ratio in step 2. 
But, as already noted, in Bayesian applications p(u) oc L(u)tt(u), a form 
that is typically available. 


Theorem 5.4 For the Metropolis algorithm outlined above, under mild 
conditions, U (t) -1 U ~ p as t —* oo. 

The general proof of this result using the theory of irreducible Markov 
chains (Nummelin, 1984) is given by Tierney (1994). For finite state spaces, 
Ripley (1987, p. 113) shows why we need symmetry in q. In this case, the 
transition kernel can be represented by a matrix, say P, wherein Py- gives 
the probability of moving from state i to state j. Our Markov chain will 
have equilibrium distribution d — (di,..., dk)' if and only if 

d'P = d' . 


Since our candidate transition matrix Q is symmetric, we have 
d,:Pij - di min ^1, Qij 

= min(d, : , dj)Q,j 
= min(di, dj)Qji 
— d P 

That is, the chain is reversible. But this in turn implies that 

k k 

(d'P),- = = £>P,-t 

i = 1 i = 1 

k 

= ^ Pji — dj. 

Thus our Metropolis chain does have the proper equilibrium distribution 

d. 

To show convergence to and uniqueness of the equilibrium distribution, 
we need a bit more machinery. The equilibrium distribution is unique if P 
is irreducible', that is, if for every set A with p( A) > 0, the probability of 
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the chain ever entering A is positive for every starting point u. (Note that 
P is irreducible if and only if Q is.) The distribution of [/(*) will converge 
to the equilibrium distribution if and only if P is aperiodic. That is, the 
chain can move from any state to any other; there can be no "absorbing" 
states (or collection of states) from which escape is impossible. 

For the continuous parameter settings we shall encounter, the most con¬ 
venient choice for q is a E) density, since it is easily sampled and 

clearly symmetric in v and 8 f '~ ] While in theory any positive definite 
covariance matrix E should suffice, in practice its selection requires a bit 
of care. Thinking in one dimension for the moment notice that choosing 
too large a value for E will cause the candidate density to take large jumps 
around the parameter space - many to places far from the support of the 
posterior. Thus many of these candidates will be rejected, and the chain 
will have a tendency to "get stuck" in one state and not traverse the sample 
space well. On the other hand, choosing an overly small E can be harmful 
as well, since then the chain will have a high acceptance rate, but wind 
up "baby-stepping" around the parameter space. This will again increase 
the autocorrelation in the chain and lengthen the time required to visit the 
entire space. 

Muller (1991) suggested matching E as nearly as possible to the true 
posterior covariance matrix, perhaps by obtaining a crude estimate of E 
from an initial sampling run. Other authors have recommended choosing E 
to provide an observed acceptance ratio near 50%. A recent investigation by 
Gelman, Roberts, and Gilks (1996) shows that this second piece of advice is 
close to optimal when the target density is univariate normal: the first order 
autocorrelation in the chain is minimized by taking \/s equal to 2.4 times 
the true posterior standard deviation, and this candidate density leads to 
a 44.1% acceptance rate. However, matters are a bit different in higher 
dimensions, as these authors go on to show using the theory of Langevin 
diffusion processes. For the multivariate target density having product form 

K 

p{°)= 

i = 1 

for some one-dimensional density g, the optimal acceptance rate approaches 
23.4% as the dimensionality ( K ) goes to infinity. This suggests that in high 
dimensions, it is worth risking an occasional sticking point in the algorithm 
in order to gain the benefit of an occasional large jump across the parameter 
space. 

A simple but important generalization of the Metropolis algorithm was 
provided by Hastings (1970). This paper dropped the requirement that q 
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(5.25) 


be symmetric, and redefined the acceptance ratio as 


p(v)q(u, v) 
p(u)q(v, u) ' 


It is easy to show that with this small modification, the revised algorithm 
now converges to the desired target distribution for any candidate density 
q ■ For example, we might let V = u + Z where Z is distributed according 
to some density g. That is, we have q(v, u) = g(v - u); if g is symmetric 
about the origin, this reduces to the original Metropolis algorithm. Tierney 
(1994) refers to a chain produced via this approach as a random walk chain. 
Alternatively, we might employ an independence chain wherein V = Z, 
so that q(v, u) — g(v) (i.e., we completely ignore the current value of 
the chain when generating the candidate). Here, using equation (5.25) we 
obtain the acceptance ratios r = w(v)/w(u) where w(u) =p(u)/g(u), the 
usual weight function used in importance sampling. This suggests using the 
same guidelines for choosing g as for choosing a good importance sampling 
density (i.e., making g a good match for /?), but perhaps with heavier tails to 
obtain an acceptance rate closer to 0.5, as encouraged by Gelman, Roberts 
and Gilks (1996). Further guidance on the proper choice of Hastings kernel, 
as well as a derivation of the algorithm from the logic of reversibility, is 
given by Chib and Greenberg (1995a). 

Example 5.7 We illustrate the Metropolis-Hastings algorithm using the 
data in Table 5.4, which are taken from Bliss (1935). These data record the 
number of adult flour beetles killed after five hours of exposure to various 
levels of gaseous carbon disulphide (CS2). Since the variability in these 
data cannot be explained by the standard logistic regression model, we 
attempt to fit the generalized logit model suggested by Prentice (1976), 


F(death|w;) = h{w) = {exp(x')/(l +exp(x))}” !l . 


Here, w is the predictor variable (dose), and x — ( w — p )/cr where p (E ■if’ 
and a 2 , m 1 >0. Suppose there are y, flour beetles dying out of ,i, exposed 
at level it>,- i = I-- . . , k. 

For our prior distributions, we assume that mi ~ Gamma(ao, 6 q), p ~ 
Normal(co, do), and a 2 ~ Inverse Gamma(e 0 , /o), where a 0l b 0 , c 0 ,d 0 ,e 0 , 
and f 0 are known, and mi, p, and a 2 are independent. While these common 
families may appear to have been chosen to preserve some sort of conjugate 
structure, this is not the case: there is no closed form available for any of the 
three full conditional distributions needed to implement the Gibbs sampler. 
Thus we instead resort to the Metropolis algorithm. Our likelihood-prior 
specification implies the joint posterior distribution 

p(p,a 2 ,m i |y) oc f(y\p, a 2 , a 2 , mi) 


OC 
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Dosage 

Wi 

# killed 

Vi 

# exposed 

n-i 

1.6907 

6 

59 

1.7242 

13 

60 

1.7552 

18 

62 

1.7842 

28 

56 

1.8113 

52 

63 

1.8369 

53 

59 

1.8610 

61 

62 

1.8839 

60 

60 


Table 5.4 Flour beetle mortality data 


x exp 


1 ( c 0 \ 

2 { d 0 J 


m i 

bo 



We begin by making a change of variables from (//, <T", m .\) to 8 = 
(9 1 , $ 2 , 83 ) = (fi< ^ log 0,2 ! log mi). This transforms the parameter space to 
5ft 3 (necessary if we wish to work with Gaussian proposal densities), and 
also helps to symmetrize the posterior distribution. Accounting for the Ja¬ 
cobian of this transformation, our target density is now 


p(0|y) oc 


n[M“>i)] yi [l - A(u>, •)]"*' 


0=1 


exp(a o 0 3 - 2 eo 8 2 ) 


x exp 



exp(6> 3 ) 


bo 


exp(-26> 2 ) 

fo 


Numerical stability is improved by working on the log scale, i.e., by comput¬ 
ing the Metropolis acceptance ratio as r = exp[logp(v|y) — logp(0 <<-1 ^|y)]. 

We complete the prior specification by choosing the same hyperparame¬ 
ter values as in Carlin and Gelfand (1991b). That is, we take ao = .25 and 
bo = 4, so that rrij has prior mean 1 (corresponding to the standard logit 
model) and prior standard deviation 2. We then specify rather vague priors 
for fi and a 2 by setting Co = 2, d 0 = 10, e 0 = 2.000004, and fo = 1000; the 
latter two choices imply a prior mean of .001 and a prior standard deviation 
of .5 for <t 2 . Figure 5.7 shows the output from three parallel Metropolis sam¬ 
pling chains, each run for 10,000 iterations using a jV 3 ( 0 (t_1 \ £) proposal 
density where 

S = D = Diag(.mi2., .033, .10). (5.26) 

The histograms in the figure include all 24,000 samples obtained after a 
burn-in period of 2000 iterations (reasonable given the appearance of the 
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Figure 5.7 Metropolis analysis of the flour beetle mortality data using a Gaussian 
proposal density with a diagonal E matrix. Monitoring plots use three parallel 
chains, and histograms use all samples following iteration 2000. Overall Metropo¬ 
lis acceptance rate: 13.5%. 


monitoring plots). The chains mix very slowly, as can be seen from the 
extremely high lag 1 sample autocorrelations, estimated from the chain #2 
output and printed above the monitoring plots. The reason for this slow 
convergence is the high correlations among the three parameters, estimated 

as Corr(9i, Of) = -0.78, Corr(e u 0 3 ) = -0.94, and Corr(02,0 3 ) = 0.89 

from the chain #2 output. As a result, the proposal acceptance rate is low 
(13.5%) and convergence is slow, despite the considerable experimentation 
in arriving at the three variances in our proposal density. 

We can accelerate convergence by using a nondiagonal proposal covari¬ 
ance matrix designed to better mimic the posterior surface itself. From the 
output of the first algorithm, we can obtain an estimate of the posterior 
covariance matrix in the usual way as E = ^ (&j ~~ ®)(Qj — 9)', where 

j indexes Monte Carlo samples. (We used the 1MSL routine DCORVC for 
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Figure 5.8 Metropolis analysis of the flour beetle mortality data using a Gaus¬ 
sian proposal density with a nondiagonal E matrix. Monitoring plots use three 
parallel chains, and histograms use all samples following iteration 2000. Overall 
Metropolis acceptance rate: 27.3%. 


this purpose.) We then reran the algorithm with proposal variance matrix 

/ 0.000292 -0.003546 -0.007856 \ 

£ = 2£ = -0.003546 0.074733 0.117809 . (5.27) 

\ -0.007856 0.117809 0.241551 ) 

The results, shown in Figure 5.8, indicate improved convergence, with lower 
observed autocorrelations and a higher Metropolis acceptance rate (27.3%). 
This rate is close to the optimal rate derived by Gelman, Roberts, and Gilks 
(1996) of 31.6%, though this rate applies to target densities having three 
independent components. The marginal posterior results in both figures are 
consistent with the results obtained for this prior-likelihood combination 
by Carlin and Gelfand (1991b) and Miller (1991). ■ 
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5.4.4 Hybrid forms and other algorithms 

One of the nice features of MCMC algorithms is that they may be combined 
in a single problem, in order to take advantage of the strengths of each. To 
illustrate, suppose that Pi,..., P rn are Markov kernels all having stationary 
distribution p. These kernels might correspond to a Gibbs sampler and (m- 
1) Metropolis-Hastings algorithms, each with a different proposal density. 
A mixture of the algorithms arises when at iteration t, kernel Pi is chosen 
with probability a,-, where ^T =| ap 1. Alternatively, we might simply 
use the kernels in a cycle, wherein each kernel P i is used in a prespecified 
order. Clearly, a mixture or cycle of convergent algorithms will also produce 
a convergent algorithm. 

Example 5.8 Let Pi correspond to a standard Gibbs sampler, and P-j 
correspond to an independence chain Hastings algorithm (i.e., one where 

q(v, u) =g{v), independent of u). Suppose g is overdispersed with respect 
to the true posterior. For example, g might have roughly the same mode 
as p, but twice as large a standard deviation. Then a mixture of Pi and P% 
with (cti, ai) = (.9, .1) will serve to occasionally "jump start" the sampler, 
possibly with a large leap across the parameter space. This will in turn 
reduce autocorrelations while still preserving the convergence to the sta¬ 
tionary distribution p. Perhaps more simply, If could be used once every 
10 or 100 iterations in a deterministic cycle. ■ 

Another profitable method of combining transition kernels is motivated 
by the nonconjugacy problem mentioned at the end of Subsection 5.4.2. 
Suppose that 8 = (Oi,... ,9k), and that all of the full conditional distribu¬ 
tions are available in closed form except for p t ( 6, \6jp, y), perhaps due to 
nonconjugacy in the prior for 0, . Since the Metropolis-Hastings algorithm 
does not require normalized densities from which to sample, we might nat¬ 
urally think of running a Gibbs sampler with a Metropolis subalgorithm to 
obtain the necessary 0\' -* sample at iteration t. That is, we would choose a 
symmetric conditional candidate density 



and run a Metropolis subalgorithm for T iterations, accepting or rejecting 
the candidates as appropriate. We would then set 0W equal to the end re¬ 
suit of this subalgorithm, and proceed on with the outer Gibbs loop. Note 
that this same approach could be applied to any parameter for which we 
lack a closed form full conditional. This approach is often referred to as 
Metropolis within Gibbs, though some authors (notably Besag et al., 1995, 
and Chib and Greenberg, 1995 a) dislike this term, noting that Metropo¬ 
lis et al. (1953) and Hastings (1970) envisioned such algorithms in their 
original papers - long before the appearance of the Gibbs sampler. As a 
result, the terms univariate Metropolis or simply Metropolis steps might be 
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used, to distinguish the approach from Gibbs steps (where the candidate 
is always accepted) and the multivariate Metropolis algorithm introduced 
in Subsection 5.4.3. 

In using such an algorithm, two convergence issues arise. The first is a 
theoretical one: does this hybrid algorithm actually converge to the correct 
stationary distribution? The algorithm does not quite meet our definition 
of a cycle, since it is a deterministic, combination of Gibbs and Metropo¬ 
lis steps that would not themselves converge if applied individually (since 
the former does not update 0;, while the latter does not update the oth¬ 
ers). But each of the components does preserve the stationary distribution 
of the chain, so, provided the hybrid chain is aperiodic and irreducible, 
convergence still obtains. 

The second issue is more practical: how should we select T, the num¬ 
ber of Metropolis subiterations at each stage of the outer algorithm? We 
have so far been vague about how to terminate the Gibbs sampler, say¬ 
ing only that it is a hard problem; does not this hybrid algorithm then 
embed a hard problem within a hard problem? Fortunately, the answer 
is no: diagnosing convergence is no harder (or easier) than before. This is 
because convergence will occur for any value of T, and so we are free to 
pick whatever value we like. This is not to say that in a particular prob¬ 
lem, any T value will be as good as any other with respect to convergence. 
T = 100, 000 would no doubt be a poor choice, since all we would gain 
from this extensive sampling is high confidence that 0\ X is indeed a sample 
from the correct full conditional distribution. This would be of little value 
early on (i.e., when t is small), since the outer Gibbs algorithm would still 
be far from locating the true stationary distribution, and so the expensive 
sample would not likely be retained for subsequent inference. On the other 
hand, T = 1 might correspond to such a crude approximation to the true 
full conditional that convergence of the overall algorithm is retarded. In a 
trivariate normal setting investigated by Gelfand and Lee (1993), T = 4 
or 8 provided good performance, but of course this does not constitute a 
general recommendation. In practice, T = 1 is often adopted, effectively 
substituting one Metropolis rejection step for each Gibbs step lacking an 
easily sampled complete conditional. 

Muller (1991) suggested a specific implementation of this idea intended 
to speed convergence. Besides employing univariate Metropolis substeps for 
generation from unstandardized complete conditionals, it also employs a 
simple transformation that attempts to orthogonalize the parameter space. 
Specifically, at iteration t let £ = K K' be a current estimate and Cholesky 
decomposition of the covariance matrix of 0. Let tjB) = K~ 1 6^ and run 
a univariate Metropolis algorithm on the components of rf instead of 6. if 

0|y) resembles a normal distribution, then the components of r] should 
be approximately uncorrelated, thus speeding convergence of the sampler. 
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Miner recommends running this algorithm in parallel, in order to obtain 
independent samples from which to obtain the covariance estimate E. For 
those rjiS requiring a Metropolis update, the author suggests using a nor¬ 
mal proposal density with variance 1, a logical choice if the orthogonalizing 
transformation has indeed been successful (though recall Gelman, Roberts, 
and Gilks (1996) would use variance (2.4) 2 instead). Finally, he suggests up¬ 
dating the E estimate at regular intervals using the current iterates, again 
in order to improve the orthogonalization and further speed convergence. 

While Muller’s approach performs quite well in many settings, the or- 
thogonalizing transformation may be impractical, especially in high dimen¬ 
sions where an enormous number of samples will be necessary in order to 
obtain a good estimate. Moreover, the continual updating of S amounts to 
yet another form of hybrid MCMC algorithm: one that adaptively switches 
the transition kernel based on the output produced by the algorithm so 
far. While this seems perfectly sensible in this context, adaptive switching 
does not in general produce a convergent algorithm. This is shown by the 
following example, which was suggested by G.O. Roberts and developed 
more fully by Gelfand and Satin (1994). 


Example 5.9 Considerthe following simple joint posterior distribution: 



N 0 


1 

P 


P 

1 


Suppose we use the following adaptive algorithm. At iteration t, if < 0 
we transform to the new parameters U = X+ Y and V =X-Y, run one 
Gibbs step under this new ( U , V) parametrization, and finally transform 
back to ( X , Y). Elementary multivariate normal calculations show that 
the joint distribution of ( U , V ) is also normal with mean vector zero and 
covariance matrix 

(l 1 W 1 P \( 1 1 \ 2(1 +p) 0 

v i -i ) \p i a 1 -i; v o 2(1-p) 

That is, U and V can be independently generated as N{ 0, 2(1 + p)) and 
N(0, 2(1 - p )) variates, respectively. 

If X& > 0, we do not transform to (U, V )-spacc, instead drawing from 
the Gibbs full conditionals in the original parameter space. Clearly both of 
these transition kernels have the proper stationary distribution, but con¬ 
sider how the output from the adaptive algorithm will look. When X^ < 0, 
2G t+1 ' will be positive with probability 1/2 and negative with probability 
1/2, since its full conditional is symmetric about 0 and is independent of 
X^\ But when X^) > 0, X^ t+1) will be positive with probability greater 
than 1/2, since 



Y (i+i) I a - ( G _ N ^ x (t) t i _ p Ay 
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Taken together, these two results imply that long sequences from this al¬ 
gorithm will tend to have more positive Zs than negative. This is borne 
out in the simulation study conducted by Gelfand and Sahu (1994). With 
p = .9 and A' t0) ~ JV(0,1), they obtained P(A tl00) > 0) = .784 and 
E{X < 100) ) = .60 47. ■ 

In summary, adaptive (output-dependent) MCMC algorithms should be 
used with extreme caution. Perhaps the safest approach is to adapt only 
in the initial, exploratory phase of the algorithm (pilot adaptation), and 
then stick with a single, nonadaptive approach when saving samples to be 
used for subsequent inference. Alternatively, recent work by Mira and Sar¬ 
gent (2000) shows that adapting every m iterations for m fixed in advance 
will in most cases still maintain the chain's stationary distribution. Such 
(systematic adaptation) might be quite useful - say, in improving the cur¬ 
rent covariance estimate E in Miiller's approach above, or in the structured 
MCMC approach described in the next subsection. 

While the Gibbs sampler, the Metropolis-Hastings algorithm, and their 
hybrids are by far the two most common MCMC methods currently in 
use, there are a variety of extensions and other MCMC algorithms that 
are extremely useful in analyzing certain specific model settings or achiev¬ 
ing particular methodological goals. For example, Bayesian model choice 
typically requires specialized algorithms, since the models compared are 
generally not of identical size (see Sections 6.3 and 6.4 for a discussion of 
these algorithms). The remainder of this subsection describes a few useful 
specialized algorithms. 

Overrelaxation 

As already mentioned, high correlations within the sample space can sig¬ 
nificantly retard the convergence of both Gibbs and univariate Metropolis- 
Hastings algorithms. Tranformations to reduce correlations, as in Muller’s 
(1991) approach, offer a direct solution but will likely be infeasible in high 
dimensions. An alternative is to use overrelaxaiion, which attempts to speed 
the sampled chain's progress through the parameter space by ensuring that 
each new value in the chain is negatively correlated with its predecessor. 
This method is the MCMC analogue of the overrelaxation methods long 
used to accelerate iterative componentwise maximization algorithms, such 
as the Gauss-Seidel method (see e.g. Thisted, 1988, Sec. 3.11.3). 

In the case where all full conditional distributions p(6 i \9j^i, y) are Gaus¬ 
sian, Adler (1981) provides an effective overrelaxation method. Suppose 9, 
has conditional mean p, and conditional variance of. Then at iteration t, 

Adler's method draws 0^ as 

£ 

G { P = pi + tv(^* _1) - Pi) + cr<( 1 - a 2 ) 1/2 Z , (5.28) 

where — 1 < a < 1, and Z is a standard normal random variate. Choosing 
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a = 0 produces ordinary Gibbs sampling; choosing a < 0 produces overre¬ 
laxation, i.e., a ()\ q 1 sample on the other side of the conditional mean from 
0\ t ~ 1) (a > 0 corresponds to underrelaxation). When a = -1, the chain 
will not be convergent (it will get "stuck" tracing a single contour of the 
density), so values close to -1 (say, a = -0.98) are often recommended. 

Several authors generalize this approach beyond Gaussian distributions 
using transformations or Metropolis accept-reject steps, but these methods 
are often insufficiently general or suffer from unacceptably high rejection 
rates. Neal (1998) instead offers a generalization based on order statistics. 
Here we start by generating K independent values { k } t - 1 from the full 
conditional p(9i\0j 7 a, y). Arranging these values and the old value in non¬ 
decreasing order (and breaking any ties randomly), we have 

0 it o < 9i ,i < < 0i, r = < • • < 0i, K , (5-29) 

so that ris the index of the old value in this list. Then we simply take 
9i K-r as our new, overrelaxed value for d\ t] . The parameter K plays the 
role of a in Adler's method: K = 1 produces Gibbs sampling, while A —► oo 
is analogous to choosing a = -1 in (5.28). 

Neal (1998) shows how to implement this approach without explicit gen¬ 
eration of the K random variables in (5.29) by transforming the problem 
to one of overrelaxation for a Unit (0,1) distribution. This in turn requires 
the ability to compute the cumulative distribution function and its inverse 
for each full conditional to be overrelaxed. Still, the method may be worth¬ 
while even in settings where this cannot be done, since the increase in 
mntime created by the K generations is typically offset by the resulting 
increase in sampling efficiency. Indeed, the WinBUGS package incorporates 
overrelaxation via direct sampling of the values in (5.29). 

Blocking and Structured MCMC 

A general approach to accelerating MCMC convergence is blocking, i.e., 
updating multivariate blocks of (typically highly correlated) parameters. 
Recent work by Liu (1994) and Liu et al. (1994) confirms its good perfor¬ 
mance for a broad class of models, though Liu et al. (1994, Sec. 5) and 
Roberts and Satin (1997, Sec. 2.4) give examples where blocking actually 
slows a sampler's convergence. Unfortunately, success stories in blocking 
are often application-specific, and general rules have been hard to find. 

Sargent, Hodges, and Carlin (2000) introduce a simple, general, and flex¬ 
ible blocked MCMC method for a large class of hierarchical models called 
Structured MCMC (SMCMC). The approach accelerates convergence for 
many of these models by blocking parameters and taking full advantage 
of the posterior correlation structure induced by the model and data. As 
an elementary illustration, consider the balanced one-way random-effects 
model with JK observations y, j, i = 1, . . . , K,j = 1, . . . , J. The first stage 
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of the model assumes that y t] = 0, Y(ij where f t] ~ N(0, cr~). At the second 
stage Oi — n + Si where 6 t ~ N(0, v), while at the third stage \i — M + £ 
where £ ~ N( 0, r 2 ). Adding prior distributions for <j 2 and u completes the 
model specification. Rewriting the second and third stages as 


0 — —Oi + n + Si , 

and M = /t — £ , 

the model can be expressed in the form of a linear model as 
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(5.30) 


S' = (Si,..., 5 k) 1 ■ Hodges (1998) shows that Bayesian inferences based on 
(5.30) are identical to those drawn from the standard formulation. 

A wide variety of hierarchical models can be reexpressed in a general 
form of which (5.30) is a special case, namely. 
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+ 


(5.31) 


In more compact notation, (5.31) can be expressed as 


Y = X9 + E, (5.32) 

where X and Y are known, 0 is unknown, and E is an error term having 
Cov( E) = T with F block diagonal having blocks corresponding to the 
covariance matrices of each of e, 6, and £, i.e., 

Cov(E) = F = Diag (Cov(e), Cov(6} } Cov(£)) . (5.33) 

In our simple example (5.30), T is fully (not just block) diagonal. 

Hodges (1998) uses the term "data case" to refer to rows of X, Y, and 
E in (5.32) corresponding to X\ in (5.31). These are the terms in the joint 
posterior into which the outcomes y enter directly. Rows of X, Y , and 
E corresponding to the Hi are "constraint cases." They place restrictions 
(stochastic constraints) on possible values of 0 j. Finally, the rows of X, Y, 
and E corresponding to the Gi are called "prior cases," this label being 
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reserved for cases with known (specified) error variances. We refer to this 
format as the constraint case formulation of a hierarchical model. 

For models with normal errors for the data, constraint, and prior cases, 
Gibbs sampling and (5.32) allow multivariate draws from the marginal 
posterior of 8. From (5.32), 8 has conditional posterior 

(XT- 1 *)- 1 ) . (5.34) 

With conjugate priors for the variance components (i.e., gamma or Wishart 
priors for their inverses), the full conditional for the variance components 
is a product of gamma and/or Wishart distributions. Convergence for such 
samplers is often virtually immediate; see Flodges (1998, Sec. 5.2), and the 
associated discussion by Wakefield (1998). The constraint case formulation 
works by using, in each MCMC draw, all the information in the model and 
data about the posterior correlations among the elements of Q. 

The algorithm just outlined has three key features. First, it samples 8 
(i.e., all of the mean-structure parameters) as a single block. Second, it 
does so using the conditional posterior covariance of 8, supplied by both 
the mean structure of the model, captured by X, and the covariance struc¬ 
ture of the model, captured by T. Finally, it does so using the conditional 
distribution in (5.34), for suitable definitions of X andr. A SMCMC algo¬ 
rithm for a hierarchical model is any algorithm having these three features. 

For linear models with normal errors, the suitable values of X and r are 
those in (5.32) and (5.33). These permit several different SMCMC imple¬ 
mentations, the simplest of which is the blocked Gibbs sampler discussed 
above. However, although this Gibbs implementation may converge in few 
iterations, computing may be slow due to the need to invert a large matrix, 
XT” 1 X, at each iteration. An alternative would be to use (5.34) as a can¬ 
didate distribution in a Hastings independence chain algorithm. Here we 
might occasionally update T during the algorithm's pre-convergence "burn 
in" period. This pilot adaptation scheme (Gilks et al., 1998) is simple but 
forces us to use a single (possibly imperfect) r for the post-convergence 
samples that are summarized for posterior inference. An even better alter¬ 
native might thus be to continually update T every m iterations for some 
preselected m (say, m = 100). Mira and Sargent (2000) show that such 
a systematic adaptation scheme can deliver improvements in convergence 
without disturbing the stationary distribution of the Markov chain. 
Example 5.10 A recent clinical trial (Abrams et al., 1994) compared di- 
danosine (ddl) and zalcitabine (ddC) in HIV-infected patients. This dataset 
is described in detail in Section 8.1; for now, we take the response variable 
for patient i to be the change in CD4 count (for which higher levels 
indicate a healthier immune system) between baseline and the two-month 
follow-up, for the K = 367 patients who had both measurements. Three 
binary predictor variables and their interactions are of interest: treatment 
group (aii,: 1 = ddC, 0 = ddl), reason for eligibility (a? 2 t: 1 = failed ZDV, 
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0 = intolerant to ZDV), and baseline Karnofsky score (x 3 f. 1 = score > 85, 
0 = score < 85; higher scores indicate better health). 

Consider the saturated model for this 2 3 factorial design: 

Yi = 6 0 + xuO\ + x 2 i 8 2 + x 3 i 6 3 

+X 1 iX2i0i2 + X 1 iX 3 iO 13 + X2iX 3 iO23 + X 1 iX2iX 3i 0i23 + £i , ’ 

where q A r (0, <r 2 ), i = 1, 367. We use flat priors for the intercept 0 O 
and main effects (Q l , 6 o, 6 3 ), but place hierarchical constraints on the two- 
and three-way interaction terms, namely Q i ~ N(0,af) for 1 = 12, 13, 23, 
and 123. This linear model is easily written in the form (5.31). Adopting a 
vague 0(0.0001.0.0001) prior for r = l/a 2 (i.e., having mean 1 and variance 
10 4 ) and independent G(l,l) priors for the hi = 1/<t 2 , 1 = 12,13, 23,123, 
completes the specification. 

Equation (5.34) yields a SMCMC implementation that alternately sam¬ 
ples from the multivariate normal full conditional p(0|h, r, y), and the 
gamma full conditionals p(r|0,h,y) and p(hi\d, t, y). As previously men¬ 
tioned, this Gibbs sampler is a SMCMC implementation that updates the 
candidate T at every iteration. We also consider a pilot-adaptive SMCMC 
implementation, in which we update V at iterations 1, . . ., 10, then every 
10th iteration until iteration 1000, and then use the value of f at iteration 
1000 in the "production" run, discarding the 1000 iteration burn-in period. 
In this chain, p(#|h, r, y) is used as a candidate distribution for 6 in a 
Hastings independence subchain, where h and f are the components of P 
at iteration 1000. 

Our model is fully conjugate, so BUGS handles it easily. Besides a standard 
univariate Gibbs sampler, BUGS allows only blocking of the fixed effects 
( 80 , 61 , 82 , 63 ) into a single vector, for which we specified a multivariate 
normal prior having near-zero precision. For a comparison based on speed 
alone, we also ran a univariate Gibbs sampler coded in Fortran. 

While we caution against putting too much stock in crude runtimes 
(which are machine-, compiler-, and programmer-dependent), our runtimes 
(in seconds) to obtain 5000 post-burn-in samples on an Ultra Sparc 20 work¬ 
station were as follows: univariate BUGS, 37.4; partially blocked BUGS, 72.7' 
univariate Gibbs implemented in Fortran, 27.1; SMCMC using fully blocked 
Gibbs, 153.0; and SMCMC with pilot adaptation, 18.5. Pilot-adaptive SM¬ 
CMC dominates the other three in terms of generation rate in "effective 
samples" per second (see Subsection 5.4.5 for a precise definition of effec¬ 
tive sample size); for 0, the advantage is substantial. The BUGS chains are 
fast but produce highly autocorrelated samples, hurting their effective gen¬ 
eration rate. By contrast, SMCMC implemented as a fully blocked Gibbs 
sampler produces essentially uncorrelated draws, but at the cost of long 
runtimes because of repeated matrix inversions. ■ 

Next consider hierarchical nonlinear models, i.e., models in which the 
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outcome y is not linearly related to the parameters in the mean structure. 
The data cases of such models do not fit the form (5.31), but the constraint 
and prior cases can often be written as a linear model. Because of the 
nonlinearity in the data cases, it is now less straightforward to specify the 
"suitable definitions of X and T” needed for a SMCMC algorithm. To do 
so, we must supply a linear structure approximating the data's contribution 
to the posterior. We do this by constructing artificial outcome data y with 
the property that E(y |0i) is roughly equal to 0j. Rough equality suffices 
because we use y only in a Metropolis-Hastings implementation to generate 
candidate draws for 6 . Specifically, for a given y with covariance matrix V, 
we can write the approximate linear model 
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(5.36) 


where e ~ N(0 V). Equation (5.36) supplies the necessary X matrix for a 
SMCMC algorithm, and the necessary r uses cov(e = V and the appro¬ 
priate covariance matrices for 6 and £ in (5.33). 

The artificial outcome data y can be supplied by two general strategies. 
The first makes use of crude parameter estimates from the nonlinear part 
of the model (without the prior or constraint cases), provided they are 
available. This might occur when the ratio of data elements to parameters 
is large and the constraint and prior cases are included solely to induce 
shrinkage in estimation. We call such estimates "unshrunk estimates," and 
use them and an estimate of their variance as y and V, respectively. 

In certain problems, however, such unshrunk estimates may be unstable, 
or the model may not be identified without constraint and prior cases. Here 
one can instead run a simple univariate Metropolis algorithm for a small 
number of iterations, using prior distributions for the variance components 
that insist on little shrinkage. The posterior mean of# approximated by this 
algorithm is a type of artificial data y which, when used with the constraint 
and prior cases in (5.36), approximates the nonlinear fit well enough for 
the present purpose. These "low shrinkage" estimates and corresponding 
estimates of their variance can then be used as y anc j y ^ respectively; see 
Sargent et al. (2000) for implementational details. 


Auxiliary variables and slice sampling 

Yet another variant on the basic MCMC approach is to use auxiliaty vari¬ 
ables. In this approach, in order to obtain a sample from a particular sta¬ 
tionary distribution p(x), x £ X with x possibly vector-valued, we enlarge 
the original state space X using one or more auxiliary (or latent) vari- 
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ables. In the spirit of the original data augmentation algorithm, a clever 
enlargement of the state space can broaden the class of distributions we 
may sample, and also serve to accelerate convergence to the (marginal) 
stationary distribution. Use of this idea in MCMC sampling arose in sta¬ 
tistical physics (Swendsen and Wang, 1987; Edwards and Sokal, 1988), was 
brought into the statistical mainstream by Besag and Green (1993), and 
recently developed further by Neal (1997), Higdon (1998), and Damien et 
al. (1999). 

The basic idea is as follows. Let p{x) be the target distribution, likely 
known only up to a normalizing constant. A new variable u is introduced 
having conditional distribution p(u|x), so that the joint distribution of u 
and x is specified as p(it, x) = p(a;)p(u|a:). We then construct a Markov 
chain on X X U by alternately updating x and u via Gibbs sampling or some 
other method that preserves p(u, x), hence the desired marginal distribution 
p(x). The conditional of u given x is chosen in a "convenient" way, where 
this can mean either that the auxiliary variable is introduced to allow better 
mixing or faster convergence of the process defined on the enlarged state 
space, or that the resulting algorithm is easier to implement or runs faster 
in terms of CPU time. 

In the simple slice sampler, we take u \x ~ Unif(0,p(x)). This leads to 
a uniform joint distribution 

p{x,u) oc I{o<.u<p(x)} (*U (5.37) 

whose marginal density for x is clearly p(x). A Gibbs sampler for this distri¬ 
bution would require only two uniform updates: u\x ~ Unif(0 , p(x)), and 
x\u ~ Unif(x : p(x) > u). Convenient generation of the latter variate (over 
the "slice" defined by u) thus requires that p(x) be invertible, either ana¬ 
lytically or numerically. Once we have a sample of (x, u) values, we simply 
disregard the u samples, thus obtaining (after a sufficiently long burn-in 
period) a sample from the target distribution p(x) via marginalization. 

More generally, if p(x) oc ir(x)L(x) where L(x) is a positive function (as 
in the usual Bayesian context of an unknown posterior p proportional to 
a prior 7 r times a likelihood L), then a convenient choice for the condi¬ 
tional distribution of u is Unif( 0, L(x)). More generally we might factor 
p{x) as q(x)L '(x) where L'(x) is a positive function. This is what is gen¬ 
erally referred to as the slice sampler. Note that, like the Gibbs sampler, 
this algorithm does not require the user to specify any tuning constants 
or proposal densities, and yet, like the Metropolis-Hastings algorithm, it 
allows sampling in nonconjugate model settings. Indeed, Higdon (1998) 
shows that the Metropolis algorithm can be viewed as a special case of the 
slice sampler. 

Finally, suppose p(x) oc tt(x) []" =1 Li(x). We can use a collection of aux¬ 
iliary variables u = (u\,u n ), where we take p(u,ja;) to be Unif (0, Li(x)) 
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for all i. Now the joint density is given by 

n 

p(x, Ml,, u n ) oc ir(x) I{a<u t <L,(x)}(x, w*) • (5.38) 

i = l 

Clearly the marginal density for x is /Ax), as desired. The full conditionals 
for the m are Unif (0, while the full conditional for x is simply 

7r(^) restricted to the set { x : Li(x) > Ui , i = l,...,n}. Damien et al. 
(1999) show that this product slice sampler can be used to advantage in 
numerous hierarchical and nonconjugate Bayesian settings where the Li are 
readily invertible and the intersection set {a; : L,(x) > Uj, i = 1,..., n} is 
thus easy to compute. However, Neal (1997) points out that introducing so 
many auxiliary variables (one for each term in the likelihood) may lead to 
slow convergence. This author recommends hybrid Monte Carlo methods 
that suppress the random walk behavior, including the overrelaxed version 
(Neal, 1998) described above. Mira (1998) also shows clever approaches 
to efficiently sample the variable of interest given the proposed auxiliary 
variable(s); rejection sampling from w(x) may be used as a last resort. 

Mira and Tierney (1997) and Roberts and Rosenthal (1999a) give mild 
and easily verified regularity conditions that ensure that the slice sampler 
is geometrically and often even uniformly ergodic. Mira and Tierney (1997) 
show that given any Metropolis-Hastings independence chain algorithm us¬ 
ing some h(x) as proposal distribution, a "corresponding" slice sampler can 
be designed (by taking q(x) = h(x) in the p(x) =q(x)L'(x) factorization 
above) that produces estimates with uniformly smaller asymptotic vari¬ 
ance (on a sweep-by-sweep basis), and that converges faster to stationarity. 
Roberts and Rosenthal (1999a) also show that the slice sampler is stochas¬ 
tically monotone, which enables development of some useful, quantitative 
convergence bounds. All of this suggests that, when the slice sampler can be 
implemented, it has spectacularly good theoretical convergence properties. 

Future work with slice sampling looks to its “perfect sampling" extension 
(Mira, Mpller, and Roberts, 1999; see also our brief discussion of this area 
on page 176), as well as its application in challenging applied settings be¬ 
yond those concerned primarily with image restoration, as in Hurn (1997) 
and Higdon (1998). An exciting recent development in this regard is the 
polar slice sampler (Roberts and Rosenthal, 1999b), which has excellent 
convergence properties even for high dimensional target densities. 

Finally, we hasten to add that there are many closely related papers 
that explore the connection between expectation-maximization (EM) and 
data augmentation methods, borrowing recent developments in the former 
to obtain speedups in the latter. The general idea is the same as that for 
the auxiliary variable methods described above: namely, to expand the pa¬ 
rameter space in order to achieve algorithms which are both suitable for 
general usage (i.e., not requiring much fine-tuning) and faster than stan- 
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dard Metropolis or Gibbs approaches. For example, the "conditional aug¬ 
mentation" EM algorithm of Meng and Van Dyk (1997) and the "marginal 
augmentation" parameter expanded EM (PX-EM) algorithm of Liu et al. 
(1998) are EM approaches that served to motivate the later, specifically 
MCMC-oriented augmentation schemes of Meng and Van Dyk (1999) and 
Liu and Wu (1999), the latter of which the authors refer to as parameter 
expanded data augmentation (PX-DA). 

Summary 

Many of the hybrid algorithms presented in this subsection attempt to 
remedy slow convergence due to high correlations within the parameter 
space. Examples include MCMC algorithms which employ blocking (i.e., 
updating parameters in medium-dimensional groups, as in SMCMC) and 
collapsing (i.e., generating from partially marginalized distributions), as 
well as overrelaxation and auxiliary variable methods like the slice sampler. 
But of course, the problem of estimation in the face of high correlations is 
not anew one: it has long been the bane of maximization algorithms applied 
to likelihood surfaces. Such algorithms often use a transformation designed 
to make the likelihood function more regular (e.g.. in 2-space, a switch from 
a parametrization wherein the effective support of the likelihood function is 
oblong to one where the support is more nearly- circular). Muller's ( 1991) 
“orthogonalizing” transformation applied within his adaptive univariate 
Metropolis algorithm is one such approach; see also Hills and Smith (1992) 
in this regard. Within the class of hierarchical linear mixed models. Gelfand 
et al. (1995) discuss simple hierarchical centering reparametrizations that 
often enable improved algorithm convergence. 

Accelerating the convergence of an MCMC algorithm to its stationary 
distribution is currently an extremely active area of research. Besides those 
presented here, other promising ideas include resampling and adaptive 
switching of the transition kernel (Gelfand and Sahu. 1994) and multi¬ 
chain annealing or tempering (Geyer and Thompson. 1995: Neal, 1996a). 
Gilks and Roberts (1996) give an overview of these and other acceleration 
methods. In the remainder of this chapter, we continue with variance es¬ 
timation and convergence investigation for the basic algorithms, returning 
to these more specialized ones in subsequent chapters as the need arises. 

5.4.5 Variance estimation 

We now turn to the problem of obtaining estimated variances (equivalently, 
standard errors) for posterior means obtained from MCMC output. Our 
summary here is rather brief; the reader is referred to Ripley (1987, Chapter 
6) for further details and alternative approaches. 

Suppose that for a given parameter \ we have a single long chain of 
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MCMC samples {A^}^, which for now we assume come from the station¬ 
ary distribution of the Markov chain (we attack the convergence problem 
in the next subsection). The simplest estimate of ^(Aly) is then given by 

E{\\y) = \ N = ±J2 X(t) ’ 

v t=i 

analogous to the estimator given in (5.6) for the case of iid sampling. Con¬ 
tinuing this analogy, then, we could attempt to estimate Var(Xi^) by fol¬ 
lowing the approach of (5.7). That is, we would simply use the sample 
variance, ~ ^ N ) 2 , divided by N, obtaining 

Var iid (\ N ) = s\/N = ^ - A w ) 2 . 

While this estimate is easy to compute, it would very likely be an underesti¬ 
mate due to positive autocorrelation in the MCMC samples. This problem 
could be ameliorated somewhat by combining the draws from a collection 
of initially overdispersed sampling chains. Alternatively, one could simply 
use the sample variance of the N th iteration output from m independent 
parallel chains, as in Figure 5.4. But as already mentioned, this approach 
is horribly wasteful, discarding an appalling ( N - 1 )/N of the samples. A 
potentially cheaper but similar alternative would be to subsample a single 
chain, retaining only every k th sample with k large enough that the re¬ 
tained samples are approximately independent. However, MacEachern and 
Berliner (1994) give a simple proof using the Cauchy-Schwarz inequality 
that such systematic subsampling from a stationary Markov chain always 
increases the variance of sample mean estimators (though more recent work 
by MacEachern and Peruggia, 2000a, shows that the variance can decrease 
if the subsampling strategy is tied to the actual updates made). Thus, in 
the spirit of Fisher, it is better to retain all the samples and use a more 
sophisticated variance estimate, rather than systematically discard a large 
portion of them merely to achieve approximate independence. 

One such alternative uses the notion of effective sample size, or ESS 
(Kass et al. 1998, p. 99). ESS is defined as 

ESS = N/k( A) , 

where «(A) is the autocorrelation time for A, given by 

OO 

K ( A) = 1 + 2 pk( A) , 

k-l 

where pk{\) is the autocorrelation at lag k for the parameter of interest A. 
We may estimate /c(A) using sample autocorrelations estimated from the 
MCMC chain, cutting off the summation when these drop below, say, 0.1 
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in magnitude. The variance estimate for Ay is then 

Var ES s(^N) = s 2 x /ESS( A) = — ^ ^ — ^(A (i) - \ N ) 2 


Note that unless the A^ are uncorrelated, y(A) > 1 and ESS( A) < TV, so 
that VarEss(^N) > VaP»»d(A./v), in concert with intuition. That is, since 
we have fewer than N effective samples, we expect some inflation in the 
variance of our estimate. 

A final and somewhat simpler (though also more naive) method of esti¬ 
mating Ear (A at) is through hatching. Divide our single long run of length 
N into m successive batches of length k (i.e., N = mk). with batch means 
B\,..., B m . Clearly X N = B = We then have the variance 

estimate 


E arj a ^ c ^(Ajv) — 


1 


E< S ' 


(5.39) 


m(m — 1) 

v ' i=i 

provided that k is large enough so that the correlation between batches 
is negligible, and m is large enough to reliably estimate Var(Bi). It is 
important to verify that the batch means are indeed roughly independent 
- say, by checking whether the lag 1 autocorrelation of the B, is less than 
0.1. If this is not the case, we must increase k (hence TV, unless the current 
m is already quite large), and repeat the procedure. 

Regardless of which of the above estimates E is used to approximate 
Ear(Ajv), a 95% confidence interval for ^7( A |y) is then given by 


A N ± 2.025 


vT, 


where z .025 = 1-96, the upper .025 point of a standard normal distribution. 
If the hatching method is used with fewer than 30 batches, it is a good idea 
to replace 2 025 by < m - 1 .. 025 , the upper .025 point of a t distribution with 
m - 1 degrees of freedom. 


5.4.6 Convergence monitoring and diagnosis 

We conclude our presentation on MCMC algorithms with a discussion of 
their convergence. Because their output is random and autocorrelated, even 
the definition of this concept is often misunderstood. When we say that an 
MCMC algorithm has converged at time T , we mean that its output can 
be safely thought of as coming from the true stationary distribution of the 
Markov chain for all t > T. This definition is not terribly precise since we 
have not defined "safely," but it is sufficient for practical implementation: 
no pre-convergence samples should be retained for subsequent inference. 
Some authors refer to this pre-convergence time as the burn-in period. 

To provide an idea as to why a properly specified MCMC algorithm may 
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be expected to converge, we consider the following simple bivariate two- 
compartment model, originally analyzed by Casella and George (1992). 

Example 5.11 Suppose the joint distribution of the two variables X and 
Y is summarized by the 2x2 layout in Table 5.5. In this table, all p,- > 0 





0 

1 

0 

Pi 

P2 

1 

P3 

P4 


Table 5.5 2x2 multinomial distribution of X and Y. 


and Yli=i Pi - 1. Notating the joint distribution as f Ty , we note it can be 
characterized completely by the 2x2 matrix 


( fx,y( 0)0) f x> ;/(l,0) \ _ f Pi P2 \ 
\ /®,y(0, 1) fx,y{ 1)1) ) \ P3 P4 J 


The true marginal distribution of X is then 

f _ ( /*(0) \ - f P1+P3 \ 

U ~\fAl)J~{p2+P4j 


(5.40) 


and similarly for Y. The two complete conditional distributions are also 
easily written in matrix form, namely, 


f/vl*(0|0) /y|x(l|0) 

V fy\A 0|1) /y|,(l|l) 


A , _ ( /*|»(0|0) /«|y(l|0) 

Tly V / x\y (011) /*| 9 (1|1) 


Pi 

P 3 

P1+P3 

P 2 

P1+P3 
V 4 

P2+P4 

P2+P4 

Pi 

P 2 

P1+P2 

P 3 

P1+P2 

P 4 

P3+P4 

P3+P4 


(5.41) 


(5.42) 


A y | x and A x \ y can be thought of as transition matrices, giving the probabil¬ 
ities of getting from the two X -states to the two Y -states, and vice versa. 
In fact, if all we care about is the X -marginal f x (x), we could consider the 
X sequence alone, since it is also a Markov chain with transition matrix 


- Ay\ X A X \y , 

which gives P = x\X^ = x). Repeating this process t times, we 

have that (Ari*) 4 gives P (X^ = ar|X(°) = x). Writing /ir as the distri¬ 
bution of X( f \ we have 

fP = fi 0 ) (A xlx y 

= ■ Ax]x 
= fx~ 1] A x \ x . 
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Provided that every entry in A x \ x is positive, a fundamental theorem from 
stochastic processes assures that for any , f 'J 1 converges to a vector f x 
which satisfies 

fx — fx-^x\x • (5.43) 

(A check of this fact in our scenario is left as Exercise 19.) Hence using 
any starting distribution ff 11 . f x t> does converge to/ x as ( —> oo,with f-f ' 
being a better and better approximation as t grows larger. ■ 

Using a rigorous mathematical framework, many authors have attempted 
to establish conditions for convergence of various MCMC algorithms in 
broad classes of problems. For example, Roberts and Smith (1993) provide 
relatively simple conditions for the convergence of the Gibbs sampler and 
the Metropolis-Hastings algorithm. Roberts and Tweedie (1996) show the 
geometric ergodicity of a broad class of Metropolis-Hastings algorithms, 
which in turn provides a central limit theorem (i.e., asymptotic normality of 
suitably standardized ergodic sums of the output from such an algorithm); 
see also Meyn and Tweedie (1993, Chapter 17) in this regard. Results 
such as these require elements of advanced probability theory that are well 
beyond the scope of this book, and hence we do not discuss them further. 
We do however discuss some of the common causes of convergence failure, 
and provide a brief review of several diagnostic tools used to make stopping 
decisions for MCMC algorithms. 

Overparametrization and identifiability 

Ironically, the most common source of MCMC convergence difficulties is a 
result of the methodology's own power. The MCMC approach is so gen¬ 
erally applicable and easy to use that the class of candidate models for a 
given dataset now appears limited only by the user's imagination. How¬ 
ever, with this generality has come the temptation to fit models so large 
that their parameters are unidentified , or nearly so. To see why this trans¬ 
lates into convergence failure, consider the problem of finding the posterior 
distribution of Q± where the likelihood is defined by 

Yi\O u 02 ~ N(e 1 + e 2 , 1) , i = 1.n , (5.44) 

and we adopt independent flat priors for both 9i and 0 2 . Only the sum of 
the two parameters is identified by the data, so without proper priors for 
91 and 0 2 , their marginal posterior distributions will be improper as well. 
Unfortunately, a naive application of the Gibbs sampler in this setting 
would not reveal this problem. The complete conditionals for 0\ and 0 2 
are both readily available for sampling as normal distributions. And for 
any starting point, the sampler would remain reasonably stable, due to the 
ridge in the likelihood surface L{9i, 9 2 ) at 6\ + 0 2 = y. The inexperienced 
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user might be tempted to use a smoothed histogram of the o\ g ^ samples 
obtained as a (proper) estimate of the (improper) posterior density p{9\ | y)! 

An experienced analyst might well argue that Gibbs samplers like the one 
described in the preceding paragraph are perfectly legitimate, provided that 
their samples are used only to summarize the posterior distributions of iden¬ 
tifiable functions of the parameters (in this case, g(6) = 9\ +0 2 )- But while 
the deficiency in model (5.44) is immediately apparent, in more complicated 
settings (e.g., hierarchical and random effects models) failures in identifi- 
ability can be very subtle. Moreover, models that are overparametrized 
(either deliberately or accidentally) typically lead to high posterior corre¬ 
lations among the parameters (cross correlations) which will dramatically 
retard the movement of the Gibbs sampler through the parameter space. 
Even in models which are identified, but "just barely" so (e.g., model (5.44) 
with a vague but proper prior for ), such high crosscorrelations (and the 
associated high autocorrelations in the realized sample chains) can lead to 
excruciatingly slow convergence. MCMC algorithms defined on such spaces 
are thus appropriate only when the model permits a firm understanding of 
which parametric functions are well-identified and which are not. 

For the underidentified Gaussian linear model y = Xft T e with X 
less than full column rank, Gelfand and Sahu (1999) provide a surpris¬ 
ing MCMC convergence result. They show that under a flat prior on /3, 
the Gibbs sampler for the full parameter vector f3 is divergent, but the 
samples from the identified subset of parameters (say, 6 = Ai/3) form an 
exact sample from their (unique) posterior density p(<5|y). That is, such a 
sampler will produce identically distributed draws from the true posterior 
for S, and convergence is immediate. In subsequent work, Gelfand, Car¬ 
lin, and Trevisam (2000) consider a broad class of Gaussian models with 
covariates, namely. 


Y = (A” 0 XoAj ••• A' 0 A r )/3 + e 

with f3 = (Po>Pi, 11 then turns out Corr{6^\ 6 t<+1) ) approaches 
0 as the prior variance component for /3 0 goes to infinity once the chain 
has converged. This then permits independent sampling from the posterior 
distributions of estimable parameters. Exact sampling for these parame¬ 
ters is also possible in this case provided that all of the f3 prior variance 
components go to infinity. Simulation work by these authors suggests these 
results still hold under priors (rather than fixed values) for the variance 
components that are imprecise but with large means. For more on over- 
parametrization and posterior impropriety, see Natarajan and McCulloch 
(1995) Hobert and Casella(1996), and Natarajan and Kass (2000). Carlin 
and Louis (2000) point out that these dangers might motivate a return to 
EB analysis in such cases; c.f. Ten Have and Localio (1999) in this regard. 
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Proving versus diagnosing convergence 

We have already mentioned how a transition matrix (or kernel, for a con¬ 
tinuous parameter space) can be analyzed to conclude whether or not the 
corresponding MCMC algorithm will converge. But in order to apply the 
method in actual practice, we need to know not only if the algorithm will 
converge, but when. This is a much more difficult problem. Theoretical 
results often provide convergence rates which can assist in selecting be¬ 
tween competing algorithms (e.g., linear versus quadratic convergence), 
but since the rates are typically available only up to an arbitrary con¬ 
stant, they are of little use in deciding when to stop a given algorithm. 
Recently, however, some authors have made progress in obtaining bounds 
on the number of iterations T needed to guarantee that distribution being 
sampled at that time, p^' r >(9), is in some sense within t of the true sta¬ 
tionary distribution, p(9). For example, Poison (1996) develops polynomial 
time convergence bounds for a discrete jump Metropolis algorithm operat¬ 
ing on a log-concave target distribution in a discretized parameter space. 
Rosenthal (1993, 1995a,b) instead uses Markov minorization conditions, 
providing bounds in continuous settings involving finite sample spaces and 
certain hierarchical models. Approaches like these hold great promise, but 
typically involve sophisticated mathematics in sometimes laborious deriva¬ 
tions. Cowles and Rosenthal (1998) ameliorate this problem somewhat by 
showing how auxiliary simulations may often be used to verify the neces¬ 
sary conditions numerically, and at the same time provide specific values 
for use in the bound calculations. A second difficulty is that the bounds 
obtained in many of the examples analyzed to date in this way are fairly 
loose, suggesting numbers of iterations that are several orders of magni¬ 
tude beyond what would be reasonable or even feasible in practice (though 
see Rosenthal, 1996, for some tight bounds in model (3.9), the two-stage 
normal-normal compound sampling model). 

A closely related area that is the subject of intense recent research is ex¬ 
act or perfect sampling. This refers to MCMC simulation methods that can 
guarantee that a sample drawn at a given time will be exactly distributed 
according to the chain's stationary distribution. The most popular of these 
is the "coupling from the past" algorithm, initially outlined for discrete 
state spaces by Propp and Wilson (1996). Here, a collection of chains from 
different initial states are run at a sequence of starting times going back¬ 
ward into the past. When we go far enough back that all the chains have 
"coupled" by time 0, this sample is guaranteed to be an exact draw from 
the target distribution. Green and Murdoch (1999) extend this approach 
to continuous state spaces. The idea has obvious appeal, since it seems to 
eliminate the convergence problem altogether! But to date such algorithms 
have been made practical only for relatively small problems within fairly 
well-defined model classes; their extension to high-dimensional models (in 
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which the conditioning offered by the Gibbs sampler would be critical) re¬ 
mains an open problem. Moreover, coupling from the past delivers only a 
single exact sample; to obtain the collection of samples needed for infer¬ 
ence, the coupling algorithm would need to be repeated a large number of 
times. This would produce uncorrelated draws, but at considerably more 
expense than the usual "forward" MCMC approach, where if we can be 
sure we are past the burn-in period, every subsequent (correlated) sample 
can safely be thought of as a draw from the true posterior. 

As a result of the practical difficulties associated with theoretical conver¬ 
gence bounds and exact sampling strategies, almost all of the applied work 
involving MCMC methods has relied on the application of diagnostic tools 
to output produced by the algorithm. Early attempts by statisticians in this 
regard involved comparing the empirical distributions of output produced 
at consecutive (or nearly consecutive) iterations, and concluding conver¬ 
gence when the difference between the two was negligible in some sense. 
This led to samplers employing a large number of parallel, independent 
chains, in order to obtain simple moment, quantile, and density estimates. 

Example 5.12 Recall the rat population growth model from Example 5.6. 
Gelfand et al. (1990) used m = 50 parallel replications, and judged sampler 
convergence by monitoring density estimates of the form (5.16) every 5 it¬ 
erations (far enough apart to be considered independent). They concluded 
convergence at iteration T = 35 based on a "thick felt-tip pen test": the 
density estimates plotted at this iteration differed by less than the width of 
a thick felt-tip pen. Setting aside concerns about the substantial wastage 
of samples in this highly parallel approach (see Figure 5.4), not to mention 
the establishment of an international standard for MCMC convergence as¬ 
sessment pens, this diagnostic is unsuitable for another reason: it is easily 
fooled into prematurely concluding convergence in slowly mixing samplers. 
This is because it measures the distance separating the sampled distribu¬ 
tion at two different iterations, rather than that separating either from the 
true stationary distribution. ■ 


Of course, since the stationary distribution will always be unknown to 
us in practice, this same basic difficulty will plague any convergence diag¬ 
nostic. Indeed, this is what leads many theoreticians to conclude that all 
such diagnostics are fundamentally unsound. In the context of the previ¬ 
ous example, we would like to ensure that the distance between the true 
distribution and our estimate of it is small, i.e., f \ft(x) — f(x)\dx < e. But 
unfortunately, by the triangle inequality we have 


/ 


f t (x) - f{x) 


dx 


<-! 


M x ) - M x ) 


dx + J \ft(x) - f(x)\dx . 


(5.45) 


The first term in the sum, sometimes referred to as the "mean squared 
error component" or "Monte Carlo noise," becomes small as m —> oo, by 
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the Law of Large Numbers. The second term, sometimes referred to as the 
"bias component" or "convergence noise," becomes small as t —* oo, by the 
convergence of the Markov chain to its stationary distribution. Convergence 
diagnosis is thus concerned with choosing T so that the second term is 
sufficiently small; m can then be increased to whatever level is required to 
make the first term small as well. Note that for a fixed generation budget, 
there could be a tradeoff in the choice of t and m. But in any event, all 
we can hope to see is f |/ t (z) — fi + k(x)\dx, which is not directly related to 
any of the three quantities in (5.45). 

Despite the fact that no diagnostic can "prove" convergence of a MCMC 
algorithm (since it uses only a finite realization from the chain), many 
statisticians rely heavily on such diagnostics, if for no other reason than "a 
weak diagnostic is better than no diagnostic at all." We agree that they do 
have value, though it lies primarily in giving a rough idea as to the nature 
of the mixing in the chain, and perhaps in keeping the analyst from mak¬ 
ing a truly embarrassing mistake (like reporting results from an improper 
posterior). As such, we now present a summary of a few of the methods 
proposed, along with their goals, advantages, and limitations. Cowles and 
Carlin (1996) and Mengersen et al. (1999) present much more complete re¬ 
views, describing the methodological basis and practical implementation of 
many such diagnostics that have appeared in the recent literature. Brooks 
and Roberts (1998) offer a similar review focusing on the underlying math¬ 
ematics of the various approaches, rather than their implementation. 

Convergence diagnostics 

MCMC convergence diagnostics have a variety of characteristics, which we 
now list along with a brief explanation. 

1. Theoretical basis. Convergence diagnostics have been derived using a 
broad array of mathematical machinery and sophistication, from simple 
cusum plots to kernel density estimates to advanced probability theory. 
Less sophisticated methods are often more generally applicable and eas¬ 
ier to program, but may also be more misleading in complex modeling 
scenarios. 

2. Diagnostic goal. Most diagnostics address the issue of bias, or the dis¬ 
tance of the estimated quantities of interest at a particular iteration 
from their true values under the target distribution. But a few also con¬ 
sider variance, or the precision of those estimates, which is perhaps more 
important from a practical standpoint. 

3. Output format. Some diagnostics are quantitative, producing a single 
number summary, while others are qualitative, summarized by a graph 
or other display. The former type is easier to interpret (and perhaps even 
automate), but the latter can often convey additional, otherwise hidden 
information to the experienced user. 
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4. Replication. Some diagnostics may be implemented using only a single 
MCMC chain, while others require a (typically small) number of parallel 
chains. (This is one of the more controversial points in this area, as we 
shall illustrate below.) 

5. Dimensionality. Some diagnostics consider only univariate summaries of 
the posterior, while others attempt to diagnose convergence of the full 
joint posterior distribution. Strictly speaking, the latter type is prefer¬ 
able, since posterior summaries of an apparently converged parameter 
may be altered by slow convergence of other parameters in the model. 
Unfortunately, joint posterior diagnostics are notoriously difficult and 
likely infeasible in high-dimensional problems, leading us to substitute 
univariate diagnostics on a representative collection of parameters. 

6. Applicability. Some diagnostics apply only to output from Gibbs sam¬ 
plers, some to output from any MCMC scheme, and some to a subset 
of algorithms somewhere between these two extremes. Again, there is 
often a tradeoff between broad applicability and diagnostic power or 
interpretability. 

7. Ease of use. Finally, generic computer code is freely available to imple¬ 
ment some convergence diagnostics. Generic code may be written by the 
user for others, and subsequently applied to the output from any MCMC 
sampler. Other diagnostics require that problem-specific code be writ¬ 
ten, while still others require advance analytical work in addition to 
problem-specific code. Not surprisingly, the diagnostics that come with 
their own generic code are the most popular among practitioners. 

We now consider several diagnostics, highlighting where they lie with 
respect to the above characteristics. Perhaps the single most popular ap¬ 
proach is due to Gelman and Rubin (1992). Here, we run a small number 
( m ) of parallel chains with different starting points, as in Figures 5.6, 5.7, 
and 5.8. These chains must be initially overdispersed with respect to the 
true posterior; the authors recommend use of a preliminary mode-finding 
algorithm to assist in this decision. Running the m chains for 2 N iterations 
each, we then attempt to check whether the variation within the chains for 
a given parameter of interest A approximately equals the total variation 
across the chains during the latter N iterations. Specifically, we monitor 
convergence by the estimated scale reduction factor. 



(5.46) 


where B/N is the variance between the means from the m parallel chains, 
W is the average of the m within-chain variances, and df is the degrees of 
freedom of an approximating t density to the posterior distribution. Equa¬ 
tion (5.46) is the factor by which the scale parameter of the t density might 
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shrink if sampling were continued indefinitely; the authors show it must ap¬ 
proach 1 as TV —► oo. (See Appendix Subsection C.3.3 for information on 
how to obtain an S implementation of this algorithm off the web at no 
charge.) 

Gelman and Rubin's approach can be derived using standard statisti¬ 
cal techniques, and is applicable to output from any MCMC algorithm. 
Thanks to the authors' S function, it is quite easy to use and produces 
a single summary number on an easily interpreted scale (i.e., values close 
to 1 suggest good convergence). However, it has been criticized on several 
counts. First, since it is a univariate diagnostic, it must be applied to each 
parameter or at least a representative subset of the parameters; the authors 
also suggest thinking of minus twice the log of the joint posterior density 
as an overall summary parameter to monitor. The approach focuses solely 
on the bias component of convergence, providing no information as to the 
accuracy of the resulting posterior estimates, i.e.. 


E{\ |y) = 


l 

mN 


m 2 N 

E E $>■ 


j -1 t-N+l 


Finally, the method relies heavily on the user's ability to find a starting 
distribution that is actually overdispersed with respect to the true posterior 
of A, a condition we can't really check without knowledge of the latter! 

Brooks and Gelman (1998) extend the Gelman and Rubin approach three 
important ways, describing an iterated graphical implementation of the 
original approach, broadening its definition to avoid the original's implicit 
normality assumption, and developing a multivariate generalization for si¬ 
multaneous convergence diagnosis for every parameter in a model (although 
the practical usefulness of this final extension is somewhat open to ques¬ 
tion). The diagnostics remain fairly easy to compute, though unfortunately 
there is no generally available S code available for its implementation. 

In a companion paper to the original Gelman and Rubin article, Geyer 
(1992) criticizes the authors' "initial overdispersion" assumption as unver- 
ifiable, and the discarding of early samples from several parallel chains as 
needlessly wasteful. He instead recommends a dramatically different ap¬ 
proach wherein we run a single chain, and focus not on the bias but on the 
variance of resulting estimates. For example, letting Ylt-i we 

have the central limit theorem 


s/n(X n — A) £ JV(0, <r 2 ), 

where rr 2 can be estimated by 

OO 

&N — ^ w N(t)jN,t- 
t—— oo 


© 2000 by CRC Press LLC 



Here, Wfj(t) is a weight function called a lag window, and-y/v.t is an estimate 
of the lag t autocovariance function, namely, 

i =1 

The lag window satisfies 0 < w^(t ) < 1, and is used to downweight the 
large-lag terms. An approximate 95% confidence interval for.E(A|y) is thus 
obtainable as 

Ajv ± 1 . (5.47) 

Note that simply using the standard deviation of the samples in place of 
&n in (5.47) would be anti conservative, since the sampling chain will likely 
feature positive autocorrelation, leading to an underestimate of cr. Note that 
the above formulae assume no discarding of early samples; Geyer argues 
that less than 1% of the run will normally be a sufficient burn-in period 
whenever the run is long enough to give much precision (recall Gelman and 
Rubin discard the first 50% of each chain). 

Geyer's approach relies on more sophisticated mathematics, addresses 
the variance goal, and avoids parallel chains or any notion of "initial overdis¬ 
persion." However, it doesn't generalize well beyond posterior moments 
(e.g., to posterior quantiles or density estimates for A), and does not con¬ 
cern itself with actually looking for bias, simply starting its sole chain at 
the MLE or some other “likely’’ place. Finally, it is less intuitive and a bit 
unwieldy, since proper choice of the lag window is not obvious. 

The diagnostic of Raftery and Lewis (1992) begins by retaining only every 
k th sample after burn-in, with k large enough that the retained samples are 
approximately independent. More specifically, their procedure supposes our 
goal is to estimate a posterior quantile q = P(\ < u |y) to within +r units 
with probability s. (That is, with q = .025, r = .005, and s = .95, our 
reported 95% credible sets would have true coverage between .94 and .96.) 
To accomplish this, they assume z{ k) = Z 1+ ( J _ 1 ) jfc is a Markov chain, where 
Z t = I(AW < u), and use results from Markov chain convergence theory. 

The Raftery and Lewis approach addresses both the bias and variance 
diagnostic goals, is broadly applicable, and is relatively easy to use thanks 
to computer code provided by the authors (see Appendix Subsection C.3.3 
for information on how to obtain it). However, Raftery and Lewis' formu¬ 
lation in terms of a specific posterior quantile leads to the rather odd result 
that the time until convergence for a given parameter differs for different 
quantiles. Moreover, systematic subsampling typically inflates the variance 
of the resulting estimates, as pointed out in Subsection 5.4.5. 

New MCMC convergence diagnostics continue to appear; see e.g. the 
variety of clever ideas in Robert (1995). Another notable recent addition 
to the literature is that of Johnson (1998), who derives a scheme based 
on the times at which the chain “couples” with an auxiliary chain that 
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is occasionally restarted from a fixed point in the parameter space. The 
approach uses the idea of regeneration times, points in the algorithm that 
divide the Markov chain into sections whose sample paths are independent; 
see Mykland, Tierney and Yu (1995). 

Summary 

Cowles and Carlin (1996) compare the performance of several convergence 
diagnostics in two relatively simple models, and conclude that while some 
are often successful, all can also fail to detect the sorts of convergence 
failures they were designed to identify. As a result, for a generic application 
of an MCMC method, they recommend a variety of diagnostic tools, rather 
than any single plot or statistic. Here, then, is a possible diagnostic strategy: 

• Run a few (3 to 5) parallel chains, with starting points drawn (perhaps 
systematically, rather than at random) from a distribution believed to be 
overdispersed with respect to the stationary distribution (say, covering 
±3 prior standard deviations from the prior mean). 

• Visually inspect these chains by overlaying their sampled values on a 
common graph (as in Figures 5.6 , 5.7, and 5.8 for each parameter, or, for 
very high-dimensional models, a representative subset of the parameters. 
(For a standard hierarchical model, such a subset might include most 
of the fixed effects, some of the variance components, and a few well- 
chosen random effects - say, corresponding to two individuals who are 
at opposite extremes relative to the population.) 

• Annotate each graph with the Gelman and Rubin (1992) statistic and lag 
1 autocorrelations, since they are easily calculated and the latter helps 
to interpret the former (large G&R statistics may arise from either slow 
mixing or multimodality). 

• Investigate crosscorrelations among parameters suspected of being con¬ 
founded, just as one might do regarding collinearity in linear regression. 
The use of multiple algorithms with a particular model is often helpful, 

since each will have its own convergence properties and may reveal different 
features of the posterior surface. Indeed, for low-dimensional problems, 
recent adaptive versions of traditional quadrature and importance sampling 
methods often serve as good checks and may even be more efficient; see 
e.g. Evans and Swartz (1995) in this regard. Considering multiple models 
is also generally a good idea, since besides deepening our understanding 
of the data this improves our chances of detecting convergence failure or 
hidden problems with model identifiability. 

Note that convergence acceleration is a task closely related to, but more 
tractable than, convergence diagnosis. In many problems, the time spent in 
a dubious convergence diagnosis effort might well be better spent search- 
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ing for ways to accelerate the MCMC algorithm, since if more nearly un¬ 
correlated samples can be obtained, the diagnosis problem may vanish. 
We thus remind the reader that reparametrizations can often improve a 
model's correlation structure, and hence speed convergence: see Hills and 
Smith (1992) for a general discussion and Gelfand et al. (1995, 1996) for 
"hierarchical centering" approaches for random effects models. The more 
sophisticated MCMC algorithms presented in Subsection 5.4.4 (mixtures 
and cycles, overrelaxation, structured MCMC, slice sampling, etc.) also 
often lead to substantial reductions in runtime. 

Finally, since automated convergence monitoring, even using a variety 
of diagnostics, is often unsafe, we instead recommend a two-stage pro¬ 
cess, wherein model specification and associated sampling are separated 
from convergence diagnosis and subsequent output analysis. In this regard, 
Best, Cowles, and Vines (1995) have developed a collection of 5+ rou¬ 
tines called CODA for the second stage of this process, with the first stage 
being accomplished by BUGS, the recently-developed software package for 
Bayesian analysis using Gibbs sampling (Gilks et al., 1992). Appendix Sub¬ 
section C.3.3 provides a discussion of these programs, guidance on how to 
use them, and instructions on how to obtain them. 


5.5 Exercises 

1. Use equation (5.1) to obtain a first-order approximation to the Bayes 
factor BF, given in equation (2.20). 1s there any computational advan¬ 
tage in the case where the two models are nested, i.e., 9\ = (& 2 ,~y)? 

2. Suppose that f{y\9) — 11"= i fiViW)- Describe how equation (5.5) could 
be used to investigate the effect of single case deletion (i.e., removing 'J'i 
from the dataset, for i — 1, . . ., n) on the posterior expectation of g(9). 

3. Show that the probability of accepting a given 9j candidate in the rejec¬ 
tion sampling is c/M , where c = / L(9)ir(0)dO, the normalizing constant 
for the posterior distribution h( 6 ). 

4. Suppose we have a sample {9 \,..., On} from a posterior pi(#|y), and 
we want to see the effect of changing the prior from m (9) to 7T-2(t?). 
Describe how the weighted bootstrap procedure could be helpful in this 
regard. Are there any restrictions on the shape of the new prior 7r 2 for 
this method to work well? 

5. Use quadratic regression to analyze the data in Table 5.6, which relate 
property taxes and age for a random sample of 19 homes. The response 
variable Y is property taxes assessed on each home (in dollars), while 
the explanatory variable X is age of the home (in years). Assume a 
normal likelihood and the noninformative prior ir((3, a) = 1/tr, where 
ft' - (00,01,132) and the model is Y{ = /? 0 + 0iX, + foX? + (, ■ Find the 
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Y 

X 

y 

X 

925 

1 

480 

20 

870 

2 

486 

22 

809 

4 

462 

25 

720 

4 

441 

25 

694 

5 

426 

30 

630 

8 

368 

35 

626 

10 

350 

40 

562 

10 

348 

50 

546 

12 

322 

50 

523 

15 




Table 5.6 County property tax (Y) and age of home (X) 


first two moments (hence a point and interval estimate) for the intercept 
00 assuming: 


(a) standard frequentist methods, 

(b) asymptotic posterior normality (first order approximation), and 

(c) Laplace's method (second order approximation). 


6 . In the previous problem, suppose that instead of the intercept 0o , the 
parameter of interest is the age corresponding to minimum tax under 
this model. Describe how you would modify the above analysis in this 
case. 

7. Actually perform the point and interval estimation of the age corre¬ 
sponding to minimum property tax in the previous problem using a 
Monte Carlo, instead of an asymptotic, approach. That is, generate 
values from the joint posterior distribution, and then, using the formula 
for the minimum tax age in terms of the 0s, obtain Monte Carlo values 
from its posterior distribution. 

( Hint: <r 2 |y ~ ^§ 5 ), where IG denotes the inverse gamma dis¬ 

tribution,/? = 3, the number of regressors, and RSS = ||y — X/3|| 2 , the 
residual sum of squares.) 

Do you think the quadratic model is appropriate for these data? Why 
or why not? 

8 . Consider the following hierarchical changepoint model for the number 
of occurrences Y t of some event during time interval is 

y ( Poisson(0), i = 1,..., k 

1 \ Poisson(A), i = k + 1,..., n 
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Year Count Year Count Year Count Year Count 


1851 4 

1852 5 

1853 4 

1854 1 

1855 0 

1856 4 

1857 3 

1858 4 

1859 0 

1860 6 

1861 3 

1862 3 

1863 4 

1864 0 

1865 2 

1866 6 

1867 3 

1868 3 

1869 5 

1870 4 

1871 5 

1872 3 

1873 1 

1874 4 

1875 4 

1876 1 

1877 5 

1878 5 


1879 3 

1880 4 

1881 2 

1882 5 

1883 2 

1884 2 

1885 3 

1886 4 

1887 2 

1888 1 

1889 3 

1890 2 

1891 2 

1892 1 

1893 1 

1894 1 

1895 1 

1896 3 

1897 0 

1898 0 

1899 1 

1900 0 

1901 1 

1902 1 

1903 0 

1904 0 

1905 3 

1906 1 


1907 0 

1908 3 

1909 2 

1910 2 

1911 0 

1912 1 

1913 1 

1914 1 

1915 0 

1916 1 

1917 0 

1918 1 

1919 0 

1920 0 

1921 0 

1922 2 

1923 1 

1924 0 

1925 0 

1926 0 

1927 1 

1928 1 

1929 0 

1930 2 

1931 3 

1932 3 

1933 1 

1934 1 


1935 2 

1936 1 

1937 1 

1938 1 

1939 1 

1940 2 

1941 4 

1942 2 

1943 0 

1944 0 

1945 0 

1946 1 

1947 4 

1948 0 

1949 0 

1950 0 

1951 1 

1952 0 

1953 0 

1954 0 

1955 0 

1956 0 

1957 1 

1958 0 

1959 0 

1960 1 

1961 0 

1962 1 


Table 5.7 Number of British coal mining disasters by year 


9 ~ ( 7 ( 01 , 61 ), A ~ G(a 2 ,b 2 ), 9, and A independent; 

61 ~ IG(ci , di), b 2 IG(c 2 ,d 2 ), 61 and b 2 independent, 
where G denotes the gamma and IG the inverse gamma distributions. 
Apply this model to the data in Table 5.7, which gives counts of coal 
mining disasters in Great Britain by year from 1851 to 1962. (Here, 
"disaster" is defined as an accident resulting in the deaths of 10 or more 
miners.) Set a x = a 2 = .5, c x — c 2 — 1, and d\ = d 2 = 1 ( a collection of 
"moderately informative" values). Also assume k = 40 (corresponding 
to the year 1890), and obtain marginal posterior density estimates for 
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9, A, and R = 9 /A using formula (5.16) with output from the Gibbs 
sampler. 

9. In the previous problem, assume k is unknown, and adopt the following 
prior: 

k - Discrete Uniform) 1, ...,«), independent of 0 and A 

Add k into the sampling chain, and obtain a marginal posterior density 
estimate for it. What is the effect on the posterior for R ? 

10. In the previous problem, replace the third-stage prior given above with 

&i ~ Gfci, d\), b'j ~ G(c 2 , d'j), bi and b 2 independent, 

thus destroying the conjugacy for these two complete conditionals. Re¬ 
sort to rejection or Metropolis-Hastings subsampling for these two com¬ 
ponents instead, choosing appropriate values for ci, c 2 , d\. and dn. What 
is the effect on the posterior for b\ and b 2 ! For R? For k? 

11. Reanalyze the flour beetle mortality data in Table 5.4, replacing the 
multivariate Metropolis algorithm used in Example 5.7 with 

(a) a Flastings algorithm employing independence chains drawn from a 
N(8 , E) candidate density, where 8 = (1.8, -4.0, -1.0)' (roughly the 
true posterior mode) and E as given in (5.27). 

(b) a univariate Metropolis (Metropolis within Gibbs) algorithm using 
proposal densities N(0 ( ] t ~ 1 \ Du), i = 1, 2, 3, with D as given in equa¬ 
tion (5.26). 

(c) a cycle or mixture of these two algorithms, as suggested in Exam¬ 
ple 5.8. 

Based on your results, what are the advantages and disadvantages of 
each approach? 

12. Returning again to the flour beetle mortality data and model of Ex¬ 
ample 5.7, note that the decision to use E = 2E in equation (5.27) 
was rather arbitrary. That is, univariate Metropolis "folklore" suggests 
a proposal density having variance roughly twice that of the true target 
should perform well, creating bigger jumps around the parameter space 
while still mimicking the target's correlation structure. But the opti¬ 
mal amount of variance inflation might well depend on the dimension 
and precise nature of the target distribution, the type of sampler used 
(multivariate versus univariate, Metropolis versus Hastings, etc.), or any 
number of other factors. 

Explore these issues in the context of the flour beetle mortality data in 
Table 5.4 by resetting E = cE for c = 1 (candidate variance matched to 
the target) and c = 4 (candidate standard deviations twice those in the 
target) using 
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(a) the multivariate Hastings algorithm in part (a) of problem 11. 

(b) the univariate Metropolis algorithm in part (b) of problem 11. 

(c) the multivariate Metropolis algorithm originally used in Example 5.7. 

Evaluate and compare performance using acceptance rates and lag 1 
sample autocorrelations. Do your results offer any additions (or correc¬ 
tions) to the "folklore?" 

13. Spiegelhalter et al. (1995b) analyze the flour beetle mortality data in 
Table 5.4 using the BUGS language. These authors use only the usual. 


two-parameter parametrization for p i = 
different link functions. 

P(death\wi), but compare 

logit: 

Pi = 

exp (a + 0Zi ) 

1 + exp (a + 0Zi) 

probit: 

Pi = 

$(a +0Zi) , 

complementary log-log: 

Pi = 

1 - exp[— exp (a + 0Zi)\ , 


where we use the centered covariate 2 - = w t — w hi order to reduce 
correlation between the intercept a and the slope ft. Using the Windows- 
based version of this package, WinBUGS , the code to implement this model 
is as follows: 

model 

for (i in 1:k){ 

y [i] _ dbin(pCi] , n [i]) 

logit(p[i]) <- alpha + beta*(w[i]-mean(w[])) 

# probit(p[i]) <- alpha + beta*(w[i]-mean(w[])) 

# cloglog(p[i]) <- alpha + beta*(w[i]-mean(w[])) 

} # end of i loop 

alpha " dnorm(0.0, 1.0E-3) 
beta ' dnorm(0.0, 1.0E-3) 

} # end of program 

Anything after a "#" is a comment in WinBUGS, so this version corre¬ 
sponds to the logit model. Also, "dbin" denotes the binomial distribution 
and "dnorm" the normal distribution, where in the latter case the sec¬ 
ond argument gives the normal precision, not variance (thus the normal 
priors for a and /? have precision 0.001, approximating flat priors). 

(a) The full conditional distributions for a and ;j have no closed form, but 
WinBUGS does recognize them as being log-concave, and thus capable 
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of being sampled using the Gilks and Wild (1992) adaptive rejection 
algorithm. Prove this log-concavity under the logit model. 

(b) Actually carry out the data analysis in WinBUGS , after download¬ 
ing theprogram from http://www.mrc-bsu.cam.ac.uk/bugs/ .(The 
program above, along with properly formatted data and initial value 
lists, are included in the "examples" section of the help materials.) 
Do the estimated posteriors for the dosage effect j3 substantially differ 
for different link functions? 

14. Consider again the model of Example 5.6. Suppose that instead of a 
straight-line growth curve, we wish to investigate the exponential growth 
model, 



with the remaining priors and hyperpriors as previously specified. 

(a) Of the full conditional distributions for {oj}, {/?,}, a 0) A)> and <r 2 , 
which now require modification? 

(b) Of those distributions in your previous answer, which may still be 
derived in closed form as members of familiar families? Obtain ex¬ 
pressions for these distributions. 

(c) Of those full conditionals that lack a closed form, give a Metropolis 
or Hastings substep for generating the necessary MCMC samples. 

(d) Write a program applying your results to the data in Table 5.3, and 
obtain estimated posterior distributions for a o and 0q What does 
your fitted model suggest about the growth patterns of young rats? 

15. Consider the generalized Hastings algorithm which uses the following 
candidate density: 



lor Vj — uj jzi 
otherwise 


That is, the algorithm chooses (randomly or deterministically) an index 
i £ {1, . . ., K}, and then uses the full conditional distribution along the 
i th coordinate as the candidate density. Show that this algorithm has 
Hastings acceptance ratio (5.25) identically equal to 1 (as called for by 
the Gibbs algorithm), and hence that the Gibbs sampler is a special case 
of the Hastings algorithm. 

16. Show that Adler's overrelaxation method (5.28) leaves the desired distri¬ 
bution invariant. That is, show that if d\ a 11 is conditionally distributed 
as then so is 9^ g K 
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17. A random variable Z defined on (0, oo is said to have a D-distribution 
with parameters 6, f3 > 0 and k E 10 , 1 , 2 .... } if its density function is 
defined (up to a constant of proportionality) by 

p z (z)ocz s ~ 1 e-^{l-e- z ) k . (5.48) 

This density emerges in many Bayesian nonparametric problems (e.g., 
Damien, Laud, and Smith, 1995). We wish to generate observations from 
this density. 

(a) If Z's density is log-concave, we know we can generate the necessary 

samples using the method of Gilks and Wild (1992). Find a con¬ 
dition (or conditions) under which this density is guaranteed to be 
log-concave. 

(b) When Z's density is not log-concave, we can instead use an auxiliary 
variable approach (Walker, 1995), adding the new random variable 
U = (Ui,..., Uk ). The Ui are defined on (0, 1) and mutually inde¬ 
pendent given Z, such that the joint density function of Z and U is 
defined (up to a constant of proportionality) by 

k 

Pz,u(z,u) OC Z S ~ 1 e~ Pz J^[/(e-M)(Mi) • 

«=1 

Show that this joint density function pz u does indeed have marginal 
density function p z given in equation (5.48) above. 

(c) Find the full conditional distributions for Z and U, and describe a 
sampling algorithm to obtain the necessary Z samples. What special 
subroutines would be needed to implement your algorithm? 

18. (a) Consider two approaches for sampling from a N( 0, 1) target density, 

p(x) oc exp( —|x 2 ): 

• a standard Metropolis algorithm using a Gaussian proposal density, 

~ N(x^ t ~ 1 \cr 2 ), and 

• the simple slice sampler (5.37). 

Run single chains of 1000 iterations for each of these samplers, starting 
both chains at x- Vl 1 = 0. For which values of the Metropolis proposal 
variance a 2 does the slice sampler produce smaller lag 1 autocorrela¬ 
tions (i.e., faster convergence)? 

(b) Repeat the preceding investigation for a multivariate N 5 (0. / 5 ) tar¬ 
get density, p(x) oc exp(—^x'x), where x' = (*i,..* 5 ). We now 
compare 

• a univariate Metropolis algorithm, updating one component at a 
time, and 

• a product slice sampler (5.38), which introduces five auxiliary vari¬ 
ables («i,..., U 5 ). 
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Now which sampler performs better? Is the efficiency of slice sampling 
affected by the dimension increase? 

19. Using the transition matrices (5.41) and (5.42), check that (5.43) holds, 
i.e., that the true marginal density f x given in (5.40) is indeed a station¬ 
ary point of the Gibbs sampler in this problem. 

20. Consider the following two complete conditional distributions, originally 
analyzed by Casella and George (1992): 


f(x\y) ocye yx , 0 < x < B < oc 
f(y\x) oc xe ~ xy , 0 < y < B < yz 


(a) Obtain an estimate of the marginal distribution of X when B = 10 
using the Gibbs sampler. 

(b) Now suppose B = o o so that the complete conditional distributions 
are ordinary (untruncated) exponential distributions. Show analyti¬ 
cally that/ x (t) = \/t is a solution to the integral equation 


UX) -J[f 


fx\y(x\y)f y \t(y\t)dy 


fx(t)dt 


in this case. Would a Gibbs sampler converge to this solution? Why 
or why not? 


21. Consider the balanced, additive, one-way ANOVA model. 


Yij — n + Qi + eij , i — 1, ..., I, j = 1 . J, (5.49) 

where 7V(0,<x“), p £ ; )i, a t £ S K, and a; >0. We adopt a prior 

structure that is a product of independent conjugate priors, wherein p 

has a flat prior, on ~ N(0, <r 2 a ), and a] ~ IG(a,b). Assume that a 2 , a, 
and b are known. 


(a) Derive the full conditional distributions for /r, a,. and a 2 , necessary 
for implementing the Gibbs sampler in this problem. 

(b) What is meant by "convergence diagnosis?" Describe some tools you 
might use to assist in this regard. What might you do to improve a 
sampler suffering from "slow convergence?" 

(c) Suppose for simplicity that a 2 is also known. What conditions on the 
data or the priors might lead to slow convergence for /j, and the 
(Hint: What conditions weaken the identifiability of the parameters?) 

(d) Now let t]i = // + q,'. so that r/,; "centers" a,. Then we can consider two 
possible parametrizations: (1) ^. a , and (2) y-rj. Generate a sample 
of data from likelihood (5.49), assuming I = 5, J = 1, and u e = 
<r a = 1. Investigate the sample cross correlations and autocorrelations 
produced by Gibbs samplers operating on Parametrizations 1 and 2 
above. Which performs better? 
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(e) Rerun your program for the case where <j e = 1 and <j a = 10. Now 
which parametrization performs better? What does this suggest about 
the benefits of hierarchical centering reparametrizations? 

22. To further study the relationship between identifiability and MCMC 
convergence, consider again the two-parameter likelihood model 

Y ~ N(0 X +0 2 ,l), 

with prior distributions 0 1 X(ai.b'f) and 9n ~ N(a 2 , 62)1 ^1 and ^2 
independent. 

(a) Clearly 0 1 and 0 2 are individually identified only by the prior; the 
likelihood provides information only on // = 0\ + 0 2 . Still, the full 
conditional distributions p( 0 1 \ 02 ,y) and p( 02 \Oi,y) are available as 
normal distributions, thus defining a Gibbs sampler for this problem. 
Find these two distributions. 

(b) In this simple problem we can also obtain the marginal posterior 
distributions p(0\\y) and p(0 2 |y) inclosed form. Find these two dis¬ 
tributions. Do the data update the prior distributions for these pa¬ 
rameters? 

(c) Set <zi = a 2 = 50, bi = b 2 = 1000, and suppose we observe y = 0. Run 
the Gibbs sampler defined in part (a) for t = 100 iterations, starting 
each of your sampling chains near the prior mean (say, between 40 and 
60), and monitoring the progress of 0 1 ,0 2 , and // Does this algorithm 
"converge" in any sense? Estimate the posterior mean of p. Does your 
answer change using t = 1000 iterations? 

(d) Now keep the same values for a } and a 2 but set b\ = b 2 = 10. Again 
run 100 iterations using the same starting values as in part (b). What 
is the effect on convergence? Again repeat your analysis using 1000 
iterations; is your estimate of E(p\y) unchanged? 

(e) Summarize your findings, and make recommendations for running and 
monitoring convergence of samplers running on "partially unidenti¬ 
fied" and "nearly partially unidentified" parameter spaces. 
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CHAPTER 6 


Model criticism and selection 


To this point we have seen the basic elements of Bayes and empirical Bayes 
methodology, arguments on behalf of their use from several different philo¬ 
sophical standpoints, and an assortment of computational algorithms for 
carrying out the analysis. We have observed that the generality of the 
methodology coupled with the power of modern computing enables consid¬ 
eration of a wide variety of hierarchical models for a given dataset. Given 
all this, the most natural questions for the reader to ask might be: 

1. How can 1 tell if any of the assumptions 1 have made (e.g., the specific 
choice of prior distribution) is having an undue impact on my results? 

2. How can I tell if my model is providing adequate fit to the data? 

3. Which model (or models) should 1 ultimately choose for the final pre¬ 
sentation of my results? 

The first question concerns the robustness of the model, the second involves 
assessment of the model, and the third deals with selection of a model (or 
group of models). An enormous amount has been written on these three 
subjects over the last fifty or so years, since they are the same issues faced 
by classical applied statisticians. The Bayesian literature on these subjects 
is of course smaller, but still surprisingly large, especially given that truly 
applied Bayesian work is a relatively recent phenomenon. We group the 
three areas together here since they all involve criticism of a model that 
has already been fit to the data. 

Subsection 2.3.3 presented the fundamental tool of Bayesian model se¬ 
lection, the Bayes factor, while Section 2.4 outlined the basics of model 
assessment and improved estimation and prediction through model aver¬ 
aging. Armed with the computational techniques presented in Chapter 5, 
we now revisit and expand on these model building tools. While we cannot 
hope to review all of what has been done, in this chapter we will attempt to 
present the tools most useful for the applied Bayesian, along with sufficient 
exemplification so that the reader may employ the approaches indepen¬ 
dently. As much of the work in this area is problem-specific and/or fairly 
technical, in certain areas our discussion will be limited to a summary of 
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the technique and a reference to the relevant place in the literature for 
more detailed information. 

6.1 Bayesian robustness 

As we have already mentioned, a commonly voiced concern with Bayesian 
methods is their dependence on various aspects of the modeling process. 
Possible sources of uncertainty include the prior distribution, the precise 
form of the likelihood, and the number of levels in the hierarchical model. 
Of course, these are concerns for the frequentist as well. For example, the 
statistics literature is filled with methods for investigating the effect of 
case deletion, which is a special type of likelihood sensitivity. Still, the 
problem appears more acute for the Bayesian due to the appearance of the 
prior distribution (even though it may have been deliberately chosen to be 
noninformative). 

In the next subsection, we investigate the robustness of the conclusions 
of a Bayesian analysis by checking whether or not they remain essentially 
unchanged in the presence of perturbations in the prior, likelihood, or some 
other aspect of the model. Subsection 6.1.2 considers a reverse attack on 
the problem, wherein we attempt to characterize the class of model char¬ 
acteristics (say, prior distributions) that lead to a certain conclusion given 
the observed data. Both of these approaches differ slightly from what many 
authors call the "robust Bayesian viewpoint," which seeks model settings 
(particularly prior distributions) that are robust from the outset, rather 
than picking a (perhaps informative) baseline and checking it later, or cat¬ 
aloging assumptions based on a predetermined decision. The case studies 
in Chapter 8 feature a variety of robustness investigations, while further 
methodological tools are provided in Chapter 7. 

6.1.1 Sensitivity analysis 

The most basic tool for investigating model uncertainty is the sensitivity 
analysis. That is, we simply make reasonable modifications to the assump¬ 
tion in question, recompute the posterior quantities of interest, and see 
whether they have changed in a way that has practical impact on interpre¬ 
tations or decisions. If not, the data are strongly informative with respect 
to this assumption and we need not worry about it further. If so. then our 
results are sensitive to this assumption, and we may wish to communicate 
the sensitivity, think more carefully about it, collect more data, or all of 
these. 

Example 6.1 Suppose our likelihood features a mean parameter 9 and a 
variance parameter <j 2 and in our initial analysis we employed a r 2 ) 
prior on 9 and an IG(a, b) prior on erf To investigate possible sensitivity 
of our results to the prior for 6 we might first consider shifts in its mean. 
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e.g., by increasing and decreasing p by one prior standard deviation r. We 
also might alter its precision, say, by doubling and halving r. Alternatively, 
we could switch to a heavier-tailed t prior with the same moments. 

Sensitivity to the prior for a 2 could be investigated in the same way, 
although we would now need to solve for the hyperparameter values re¬ 
quired to increase and decrease its mean or double and halve its standard 
deviation. ■ 


The increased ease of computing posterior summaries via Monte Carlo 
methods has led to a corresponding increased ease in performing sensitivity 
analyses. For example, given a converged MCMC sampler for a posterior 
distribution p(#ly), it should take little time for the sampler to adjust 
to the new posterior distribution Pnew{® |y) arising from some modest 
change to the prior or likelihood. Thus, the additional sampling could be 
achieved with a substantial reduction in the burn-in period. Still, for a 
model having p hyperparameters, simply considering two alternatives (one 
larger, one smaller) to the baseline value for each results in a total of 3 P 
possible priors - possibly infeasible even using MCMC methods. 

Fortunately, a little algebraic work eliminates the need for further sam¬ 
pling. Suppose we have a sample 9\,.. .,9p) from a posterior p(9\y), which 
arises from a likelihood f(y\9) = Yil-i and a prior p(0). To study 

the impact of deleting case k, we see that the new posterior is 

p ™ (%)<xZ 7Sf“7sb) , ’ ( ' ,|y) - 


In the notation of Subsection 5.3.2, we can use g{0) = p(#|y) as an impor¬ 
tance sampling density, so that the weight function is given by 


pnew{9 |y) 

P(%) ~~ f{yk\0) 


Thus for any posterior function of interest h(9), we have E[h(9) |y] = 

Ylj=iK6j)l N and 


E N Ew[h(9)\y] ~ ——jy-—— 


( 6 . 1 ) 


using equation (5.8). Since p and pNEW should be reasonably similar, 
the former should be a good importance sampling density for the latter 
and hence the approximation in equation (6.1) should be good for mod¬ 
erate N. We could also obtain an approximate sample from pNEW via 
the weighted bootstrap by resamplina 9i values with probability q t: where 
qi = w ( 0 j)- 

In situations where large changes in the prior are investigated, or where 
outlying or influential cases are deleted, pnew may differ substantially from 
p ■ A substantial difference between the two posteriors can lead to highly 
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variable importance sampling weights and, hence, poor estimation. Impor¬ 
tance link function transformations of the Monte Carlo sample (MacEach- 
ern and Peruggia, 2000b) can be used to effectively alter the distribution 
from which the 0\ ,..., On ) were drawn, stablizing the importance sam¬ 
pling weights and improving estimation. 


Sensitivity analysis via asymptotic approximation 


Notice that importance sampling may be used for sensitivity analysis even 
if the original posterior sample { 0 X; ..., $ N } was obtained using some other 
method (e.g., the Gibbs sampler). The new posterior estimates in (6.1) are 
not likely to be as accurate as the original, but are probably sufficient for the 
purpose of a sensitivity analysis. In the same vein, the asymptotic methods 
of Section 5.2 can be used to obtain approximate sensitivity results without 
any further sampling or summation. For example, we have already seen 
in equation (5.5) a formula requiring only two maximizations (both with 
respect to the original posterior) that enables any number of sensitivity 
investigations. In the realm of model comparison. Kass and Vaidyanathan 
(1992) use such approximations to advantage in determining the sensitivity 
of Bayes factors to prior and likelihood input. Let it.\EW.k(^k ) be new priors 
on 0^ where k = 1,2 indexes the model. Writing BF for the Bayes factor 
under the original priors and BFy/EW for the Bayes factor under the new 
priors, Kass and Vaidyanathan (1992, equation (2.23)) show that 


BF new = BF • 

r2(0 2 ) 



( 6 - 2 ) 


where r k {0k) — ft N EW,k (0 k) / ftk(0 k) and Ok is the posterior mode using the 
original prior Wk(0k)- Thus, one only needs to evaluate r k at Q k for k= 1,2 
for each new pair of priors, and these computations are sufficiently easy 
and rapid that a very large number of new priors can be examined without 
much difficulty. 

A further simplification comes from a decomposition of the parameter 
vector into two components, 0 k = (0^\of^) with the first component 
containing all parameters that are common to both models. We define 
0 = 0J 1 "* (= 0^). This parameter vector might be considered a nuisance 
parameter while the 0' k 2 > k = 1,2 become the parameter vectors of interest. 
If we take the priors to be such that the two components are a priori inde¬ 
pendent under both models, we may write TTk(Ok) = ttJ . 1 1 *) • ft^\0^). 
Suppose this is the case, and furthermore 7 ^ l \0^) 

— ^2 and that 

both these conditions hold for the new priors considered. In this situation, if 
it turns out that the first-component posterior modes under the two models 
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are approximately equal (formally, to order 0(n *)), then 

ft NEW, 1 (^1) __ ^EW 

knew, 2 (^ 2 ) k ( j ^ EW2 (0 ( 2 2) ) 

to the same order of accuracy as equation (6.2). The ratio ri( 6^1 )/r 2 ( 6 ^ 2 ) in 
(6.2) thus involves the new priors only through their second components. In 
other words, if the modal components 0^ are nearly equal for k = 1 and 2 , 
when we perform the sensitivity analysis we do not have to worry about the 
effect of modest modifications to the prior on the nuisance-parameter com¬ 
ponent of 6k. Carlin, Kass, Lerch, and Huguenard (1992) use this approach 
for investigating sensitivity of a comparison of two competing models of 
human working memory load to outliers and changes in the prior distribu¬ 
tion. 

Sensitivity analysis via scale mixtures of normals 

The conditioning feature of MCMC computational methods enables an¬ 
other approach to investigating the sensitivity of distributional specifica¬ 
tions in either the likelihood or prior for a broad class of common hierar¬ 
chical models. Consider the model y x = p, + e,. i = 1 , . . n , where the IH 
are unknown mean structures and the a are independent random errors 
having density f with mean 0. For convenience, one often assumes that 
the a form a series of independent normal errors, i.e., c, \a' 2 ~ /V(0, a 2 ), 
i = 1However, the normal distribution's light tails may well make 
this an unrealistic assumption, and so we wish to investigate alternative 
forms that allow greater variability in the observed y i values. Andrews and 
Mallows (1974) show that expanding our model to 

e*|<r 2 , A, ~ 7V(0, A;(r 2 ), i — 1 ,..., n , 

and subsequently placing a prior on A, enables a variety of familiar (and 
more widely dispersed) error densities to emerge. That is, we create f as 
the scale mixture of normal distributions, 

/(e;|<r 2 ) = / p(£i\(T 2 , \i)p(\i)d\i , i = 1,.. ,,n . 

J A 

The following list identifies the necessary functional forms for p(A;) to ob¬ 
tain some of the possible departures from normality: 

• Student's t errors: If r>/A,; ~ (i e., if A; ~ IG( p)) then e t \ti ~ 
f„( 0 ,cr). 

• Double exponential errors: If A, : ~ E( 2), the exponential distribution 
having mean 2, then e,|cr ~ DE(0, a), where DE denotes the double 
exponential distribution. 
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• Logistic errors: If 1 /\/\ has the asymptotic Kolmogorov distance dis¬ 
tribution, then a\a is logistic (see Andrews and Mallows, 1974). 

Generally one would not view the addition of n parameters to the model as 
a simplifying device, but would prefer instead to work directly with the non¬ 
normal error density in question. However, Carlin and Poison (1991) point 
out that the conditioning feature of the Gibbs sampler makes this augmen¬ 
tation of the parameter space quite natural. To see this, note first that 
under an independence prior on the Aj, we have: A, |{ A,^,}. {pj }. a 1 , y ~ 
\i\Hi,a 2 ,yi. But by Bayes theorem, p(A;|/q, cr 2 , j/i) oc | , cr 2 , A,-)p(A,-) , 
where the appropriate normalization constant, p(yi\pi, a 2 ), is known by 
construction as the desired nonnormal error density. Hence the complete 
conditional for A,; will always be of known functional form. Generation of 
the required samples may be done directly if this form is a standard density; 
otherwise, a carefully selected rejection method may be employed. Finally, 
since the remaining complete conditionals (for cr 2 and the pj are deter¬ 
mined given A, convenient prior specifications may often be employed with 
the normal likelihood, again leading to direct sampling. 


Example 6.2 Consider the model y,- = p, + , where j indexes the 

model error distribution (i.e., the mixing distribution for the \ t ), and i 
indexes the observation, as before. Suppose that p i = /(x;,#)/3, where 
/3 is a ^-dimensional vector of linear nuisance parameters, and f (x,, 6) = 
(/i(xj, 6), ..., fk(*i, 0 )) is a collection of known functions, possibly non¬ 
linear in 9. Creating the n x k matrix 7) = (f(x, .6)). the log-likelihood 
logp(y | 6, (3, A, a 2 ) is 

1 1 " 

~2^(y - - F «P) - «log(T - - ^logAi , 


where E, = Diag( Ai,...,A n ). Assume that (3 ~ N((3 n ,Y,o), where /3 0 
and So are known. In addition, let cr 2 ~ IG(ao, bo), and let 0 have prior 
distribution p(9). 

Suppose that, due to uncertainty about the error density and the impact 
of possible outliers, we wish to compare the three models 

M= 1: 0,<t 2 ), 

M — 2 : a ~ t( 0, a 2 , v = 2) , and 
M = 3 : ~ DE( 0, cr) , 

where M indicates which error distribution (model) we have selected. For 
M — 1, clearly the A, are not needed, so their full conditional distributions 
are degenerate at 1. For M — 2, we have the full conditional distribution 


IG 


fv+1 

{ 2 



' {Vi - f(x,-,6l)/3) 2 





i n , 
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in a manner very similar to that of Andrews and Mallows (1974). Finally, 
for M = 3 we have a complete conditional proportional to 


A,-' 


exp 




that is, X t \9, f3, a 2 ,y ~ GIG (|, 1, (y,- - f(xi,9)(3) 2 /a 2 ) for / = 1, . . ., n, 
where GIG denotes the generalized inverse Gaussian distribution (see De- 
vroye, 1986, p. 478). In order to sample from this density, we note that it 
is the reciprocal of an 


Inverse Gaussian 


yi - f(xi , 9)0 


a density from which we may easily sample (see for example Devroye, 1986, 
p. 149). ■ 

Thus, the scale mixture of normals approach enables investigation of 
nonnormal error distributions as a component of the original model, rather 
than later as part of a sensitivity analysis, and with little additional com¬ 
putational complexity. If the A,; all have posterior distributions centered 
tightly about 1, the additional modeling flexibility is unnecessary, and the 
assumption of normality is appropriate. In this same vein, the A,; can be 
thought of as outlier diagnostics, since extreme observations will correspond 
to extreme fitted values of these scale parameters. Wakefield et al. (1994) 
use this idea to detect outlying individuals in a longitudinal study by em¬ 
ploying the normal scale mixture at the second stage of their hierarchical 
model (i.e., on the prior for the random effect component of /i 8 ). 


6.1.2 Prior Partitioning 

The method of sensitivity analysis described in the previous subsection is 
a direct and conceptually simple way of measuring the effect of the prior 
distributions and other assumptions made in a Bayes or empirical Bayes 
analysis. However, it does not free us from careful development of the orig¬ 
inal prior, which must still be regarded as a reasonable baseline. This can 
be impractical, since the prior beliefs and vested interests of the potential 
consumers of our analysis may be unimaginably broad. For example, if we 
are analyzing the outcome of a government-sponsored clinical trial to de¬ 
termine the effectiveness of a new drug relative to a standard treatment 
for a given disease, our results will likely be read by doctors in clinical 
practice, epidemiologists, regulatory workers (e.g., from the U.S. Food and 
Drug Administration), legislators, and, of course, patients suffering from 
or at risk of contracting the disease itself. These groups are likely to have 
widely divergent opinions as to what constitutes "reasonable" prior opin¬ 
ion: the clinician who developed the drug is likely to be optimistic about 
its value, while the regulatory worker (who has seen many similar drugs 
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emerge as ineffective) may be more skeptical. What we need is a method 
for communicating the robustness of our conclusions to my prior input the 
reader deems appropriate. 

A potential solution to this problem is to "work the problem backward," 
as follows. Suppose that, rather than fix the prior and compute the pos¬ 
terior distribution, we fix the posterior (or set of posteriors) that produce 
a given conclusion, and determine which prior inputs are consistent with 
this desired result, given the observed data. The reader would then be free 
to determine whether the outcome was reasonable according to whether 
the prior class that produced it was consistent with his or her own prior 
beliefs. We refer to this approach simply as prior partitioning since we are 
subdividing the prior class based on possible outcomes, though it is im¬ 
portant to remember that such partitions also depend on the data and the 
decision to be reached. As such, the approach is not strictly Bayesian (the 
data are playing a role in determining the prior), but it does provide valu¬ 
able robustness information while retaining the framework's philosophical, 
structural, documentary, and communicational advantages. 

To illustrate the basic idea, consider the point null testing scenario Hq : 
9 — do versus Hi : 9 ^ 9q. Without loss of generality, set 9 0 = 0. Suppose 
our data x has density /(#|d), where 9 is an unknown scalar parameter. 
Let tt represent the prior probability of Hq, and G(9) the prior cumulative 
distribution function (cdf) of 9 conditional on {0 / 0}. The complete prior 
cdf for 9 is then F(9) = trl[ 0oo )(9) + (1 - tt)G(9), where Is is the indicator 
function of the set S. Hence the posterior probability of the null hypothesis 
is 


P g (9 = Ok) = 


7r /( x |0) 


(6.3) 


Ttf(x |0) + (1 - It) f f(x\9)dG(6) 

Prior partitioning seeks to characterize the G for which this probability is 
less than or equal to some small probability p £ (0, 1), in which case we 
reject the null hypothesis. (Similarly, we could also seek the G leading to 
Pq(9 k Ok) < p, in which we would reject H i. Elementary calculations 
show that characterizing this class of priors {G} is equivalent to character¬ 
izing the set 7i c , defined as 


n c = |g : J f(x\9)dG(6) > c = 0 ) j . (6.4) 


Carlin and Louis (1996) establish results regarding the features of 7i c , and 
then use these results to obtain sufficient conditions for 7i c to be nonempty 
for classes of priors that satisfy various moment and percentile restrictions. 
The latter are somewhat more useful, since percentiles and tail areas of the 
conditional prior G are transform-equivariant, and Chaloner et al. (1993) 
have found that elicitees are most comfortable describing their opinions 
through a "best guess" (mean, median, or mode) and a few relatively ex¬ 
treme percentiles (say, the 5 th and the 95 <ft ). 


© 2000 by CRC Press LLC 



To lay out the technical detail of this percentile restriction approach, let 
9 l and 9jj be such that P a (9 <9 L ) = a L and P G (0 > Ou) = ay, where a L 
and a G lie in the unit interval and sum to less than 1. For fixed values for 
0L and du we seek the region of a L , ay values (i.e., the class of priors) 
that lead to a given decision. To accomplish this we can take a general G, 
alter it to pass through 0 L ,a L and 6 V , l - ay), and then search over all 
G subject only to the constraint that the intervals (—00, ay], (ay, ay], and 
(ay, 00) all have positive probability. 

Assume that f(x\9) is a unimodal function of 0 for fixed x that vanishes 
in both tails, an assumption that will be at least approximately true for 
large datasets due to the asymptotic normality of the observed likelihood 
function. Keeping 0y , Or and a fixed, we seek the supremum and infimum 
of f f(x\9)dG(9). The infimum will always be given by 

(1 - a L ~ au) min{/(a;|6» L ), f(x\9u)} , (6.5) 

since a unimodal/must take its minimum over the central support interval 
at one of its edges. However, depending on the location of the maximum 
likelihood estimator 0, sup G f f(x\9)dG(9) equals 

a L f(x\0) + (1 - a L - a v )f(x\e L ) + ay/(a;|0y), 0 < 6 L 

a Lf(x\9L) + (l-a L -a u )f(x\0) + a u f(x\e u ), 9 L < 9 < 0 V . (6.6) 
ay/(z|0y) + (1 - ay - au)f(x\9u) + auf{x\9), 9 > 9 V 

Notice that the infimum is obtained by pushing mass as far away from 
the MLE as allowed by the constraints, while the supremum is obtained 
by pushing the mass as close to the MLE as possible. In conjunction with 
7 t, the prior probability of Ho, the supremum and infimum may be used 
to determine the prior percentiles compatible with P(Q ~ Q/j < p and 
P(9 ^ 0|ar) < p, respectively. Since Ti c is empty if the supremum does 
not exceed c, we can use the supremum expression to determine whether 
there are any G that satisfy the inequality in (6.4), i.e., whether any priors 
G exist that enable stopping to reject the null hypothesis. Similarly, the 
infimum expression may be useful in determining whether any G enable 
stopping to reject the alternative hypothesis, H\. 

We may view as fixed either the ( 9 L , 9 (J ) pair, the (a L , a v ) pair, or both. 
As an example of the first case, suppose we seek the (ai, ay) compatible 
with a fixed (0y,^y) pair (an indifference zone) for which J f{x\0)dG(9) > 
c. Then given the location of 9 with respect to the indifference zone, equa¬ 
tion (6.6) may be easily solved to obtain the half-plane in which accept¬ 
able as must lie. When combined with the necessary additional constraints 
a 1 > 0, ay > 0 , and a L -)- a,, < / the result is a polygonal region that is 
easy to graph and interpret. Graphs of acceptable ( 0l,@u ) pairs for fixed 
(ay, ay) may be obtained similarly, although the solution of equation (6.6) 
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is now more involved and the resulting regions may no longer be compact. 
We will return to these ideas in the context of our Section 8.2 data example. 

Since the ideas behind prior partitioning are somewhat more theoretical 
and less computational than those driving sensitivity analysis, the approach 
has been well developed in the literature. Two often-quoted early references 
are the paper by Edwards, Lindman, and Savage (1963), who explore robust 
Bayesian methods in the context of psychological models, and the book by 
Mosteller and Wallace (1964), who discuss bounds on the prior probabilities 
necessary to choose between two simple hypotheses (authorship of a given 
disputed Federalist paper by either Alexander Hamilton or James Madison). 

The subsequent literature in the area is vast; see Berger (1985, Section 
4.7) or, more recently, Berger (1994) for a comprehensive review. Here we 
mention only a few particularly important papers. In the point null setting, 
Berger and Sellke (1987) and Berger and Delampady (1987) show that we 
attain the minimum of P(9 = 0o\x) over all conditional priors G for 0 f 9 0 
when G places all of its mass at 9, the maximum likelihood estimate of 0. 
Even in this case, where G is working with the data against Ho these au¬ 
thors showed that the resulting P{6 = 0 o |a;) values are typically still larger 
than the corresponding two-sided p-value, suggesting that the standard 
frequentist approach is biased against Ho i n this case. In the interval null 
hypothesis setting, prior partitioning is reminiscent of the work of O'Hagan 
and Berger (1988), who obtain bounds on the posterior probability content 
of each of a collection of intervals that form the support of a univariate 
parameter, under the restriction that the prior probability assignment to 
these intervals is in a certain sense unimodal. In the specific context of clin¬ 
ical trial monitoring. Greenhouse and Wasserman (1994) compute bounds 
on posterior expectations and tail areas (stopping probabilities) over an 
G-contaminated class of prior distributions (Berger and Berliner, 1986). 


Further restricting the prior class 

Sargent and Carlin (1996) extend the above approach to the case of an 
interval null hypothesis, i.e., Hq '■ 0 £ [9l>^u] versus Hi : 9 (fc [9 l,9u]- 
This formulation is useful in the context of clinical trial monitoring, where 
[9 l ,9u] is thought of as an indifference zone, within which we are indifferent 
as to the use of treatment or placebo. For example, we might take Ojj > 0 if 
there were increased costs or toxicities associated with the treatment. Let 
7T again denote the prior probability of Ho, and let G{9 now correspond to 
the prior cdf of 9 given 6 f [0 L ,9u], Making the simplifying assumption of 
a uniform prior over the indifference zone, the complete prior density for 9 
may be written as 

P(9) = + (! - *W) ■ ( 6 - 7 ) 

ou ~ VL 
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Sargent and Carlin (1996) derive expressions similar to (6.3) and (6.4) un¬ 
der the percentile restrictions of the previous subsection. However, these 
rather weak restrictions lead to prior classes that, while plausible, are of¬ 
ten too broad for practical use. As such, we might consider a sequence of 
increasingly tight restrictions on the shape and smoothness of permissible 
priors, which in turn enable increasingly informative results. For example, 
we might retain the mixture form (6.7), but now restrict g(Q to some par¬ 
ticular parametric family. Carlin and Sargent (1996) refer to such a prior as 
"semiparametric" since the parametric form for g does not cover the indif¬ 
ference zone [8i, 9y ], although since we have adopted another parametric 
form over this range (the uniform) one might argue that "biparametric" or 
simply "mixture" would be better names. 

We leave it as an exercise to show that requiring Q £ ')-( c j s equivalent 
to requiring BF < (^), where 


BF = 


iuhn fe7 f(*\ 0 )dQ 

I f{x\e)g(6)d9 ’ 


( 6 . 8 ) 


the Bayes factor in favor of the null hypothesis. Equation (6.8) expresses 
the Bayes factor as the ratio of the marginal densities under the competing 
hypotheses; it is also expressible as the ratio of posterior to prior odds in 
favor of the null. As such, BF gives the extent to which the data have 
revised our prior beliefs concerning the two hypotheses. Note that if we 
take 7r = 1/2 (equal prior weighting of null and alternative) then a Bayes 
factor of 1 suggests equal posterior support for the two hypotheses. In this 
case, we require a Bayes factor of 1/19 or smaller to insure that p(ll (j \x) 
does not exceed 0.05. 

In practice, familiar models from the exponential family are often ap¬ 
propriate (either exactly or asymptotically) for the likelihood f)x\9). This 
naturally leads to consideration of the restricted class of conjugate priors 
9(9) ■ to obtain a closed form for the integral in the denominator of (6.8). 
Since a normal approximation to the likelihood for 6 is often suitable for 
even moderate sample sizes, we illustrate in the case of a conjugate normal 
prior. The fact that g is defined only on the complement of the indifference 
zone presents a slight complication, but, fortunately, the calculations re¬ 
main tractable under a renormalized prior with the proper support. That 
is, we take 


9(&) = 


N(9\n, r 2 ) 


1 - 


<J> 




0 t [6 l,0u] , 


where the numerator denotes the density of a normal distribution with 
mean p and variance r 2 , and 4* denotes the cdf of a standard normal 
distribution. 
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To obtain a computational form for equation (6.8), suppose we can ap¬ 
proximate the likelihood satisfactorily with a N(9\9, <f 2 ) density, where 8 
is the maximum likelihood estimate (MLE) of 8 and a 2 is a corresponding 
standard error estimate. Probability calculus then shows that 


f f(x\9)g(9)d8 


^/27t(ct 2 +T 2 ) 


exp 


2(<t 2 +t 2 ) 




(6.9) 


where r) = (<f 2 // + r 2 0)/(<r 2 + r 2 ) and l/ 2 = cr 2 r 2 /(<f 2 + r 2 ). (Note that 
rj and v 2 are respectively the posterior mean and variance under the fully 
parametric normal/normal model, described in the next subsection.) Since 

fjj u f(x\9)d9 — $ — we can now obtain the Bayes factor 

(6.8) without numerical integration, provided that there are subroutines 
available to evaluate the normal density and cdf. 

As a final approach, we might abandon the mixture prior form (6.7) in 
favor of a single parametric family h(0), preferably chosen as conjugate with 
the likelihood f(x\8). If such a choice is possible, we obtain simple closed 
form expressions for Bayes factors and tail probabilities whose sensitivity 
to changes in the prior parameters can be easily examined. For example, for 
our N(6\8, <f 2 ) likelihood under a N(6\p, r 2 ) prior, the posterior probability 
of Ho is nothing but 

P(8 € [0 L , 8u]\x) = <D - * (^ L2 ) - 


where r\ and v 2 are again as defined beneath equation (6.9). Posterior prob¬ 
abilities that correspond to stopping to reject the hypotheses Hl : 0 < 9i 
and Hu : 9 > 9u arise similarly. 


6.2 Model assessment 

We have already presented several tools for Bayesian model assessment in 
Subsection 2.4.1. It is not our intention to review all of these tools, but 
rather to point out how their computation is greatly facilitated by modern 
Monte Carlo computational methods. Since most Bayesian models encoun¬ 
tered in practice require some form of sampling to evaluate the posterior 
distributions of interest, this means that common model checks will be 
available at little extra cost, both in terms of programming and runtime. 
Consider for example the simple Bayesian residual, 

n = tji — E(yi |z), i - 1 ,... ,n, 

where z is the sample of data used to fit the model and y = (t/i, • • •, Vn)' 
is an independent validation sample. Clearly, calculation of r, requires an 
expectation with respect to the posterior predictive distribution p(yi |z), 
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which will rarely be available in closed form. However, notice that we can 
write 


E{yi\z) = J yip(yi\z)dyi 

= jj yif(yi\0)p(0\z)d0dyi 
= J E(yi\8)p(9\z)d0 


where the 0^ are samples from the posterior distribution p(0 |z) (for 
Markov chain Monte Carlo algorithms, we would of course include only 
post-convergence samples). The equality in the second line holds due to 
the conditional independence of y and z given 9, while that in the third 
line arises from reversing the order of integration. But E(yi\6) typically will 
be available in closed form, since this is nothing but the mean structure of 
the likelihood. The fourth line thus arises as a Monte Carlo integration. 

Even if E(yi\8) is not available in closed form, we can still estimate 
E(yi |z) provided we can draw samples t/j- g ' from f(y t \8^^). Such sampling 
is naturally appended onto the algorithm generating the 0^ samples them¬ 
selves. In this case we have 


E(yi |z) 


J J yif{yi\0)p{8\z)d9dyi 


9 =1 


since (yj s \8^) constitute a sample from p(yi,8 |z). This estimator will 
not be as accurate as the first (since both integrals are now being done via 
Monte Carlo), but will still be simulation consistent (i.e., converge to the 
true value with probability 1 as G —► oo). 

Next, consider the cross-validation residual given in (2.23), 


U = 2 a -E{yi\y(i)), i = 


where we recall that y(q denotes the vector of all the data except the i th 
value. Now we have 


E(yi\y(i)) = 


J E{y i \8)p(8\y {i) )d8 

j E{yi\8)p(9\y)dS 
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9=1 


where the 0^ are samples from the complete data posterior p(6\y). The 
approximation in the second line should be adequate unless the dataset 
is small and y, is an extreme outlier. It enables the use of the same B 1 ' 111 
samples (already produced by the Monte Carlo algorithm) for estimating 
each conditional mean E(m |y(;)), hence each residual n, for i= 1 

Finally, to obtain the cross-validation standardized residual in (2.24), 
we require not only £ , (j/i|y(j)) J but also Var( (/, |y (! -j). This quantity is es¬ 
timable by first rewriting V ar(yi |y (i) ) = E{yf\y (i] )-[E(yi\y (i) )] 2 , and then 
observing 

E[yf\y(i)) - J E(yf\0)p{d\y {i) )d9 

= j {Var{yi\6) + [E{y i \8)] 2 }p(d\y^ ) )d0 
1 G 
9 -1 


again approximating the reduced data posterior with the full data posterior 
and performing a Monte Carlo integration. 

Other posterior quantities from Subsection 2.4.1, such as conditional 
predictive ordinates /(j/,-|y u) ) and posterior predictive model checks (e.g., 
Bayesian p-values) can be derived similarly. The details of some of these 
calculations are left as exercises. 


6.3 Bayes factors via marginal density estimation 


As in the previous section, our goal in this section is not to review the 
methods for model choice and averaging that have already been presented 
in Subsections 2.3.3 and 2.4.2, respectively. Rather, we provide several com¬ 
putational approaches for obtaining these quantities, especially the Bayes 
factor. In Section 6.5 we will present some more advanced approaches that 
are not only implemented via MCMC methods, but also motivated by these 
methods, in the sense that they are applicable in very highly-parametrized 
model settings where the traditional model selection method offered by 
Bayes factors is unavailable or infeasible. 

Recall the use of the Bayes factor as a method for choosing between two 
competing models Mi and Mi. given in equation (2.20) as 


BF = pjyL M j .1 

p(y I m 2 ) ’ 


( 6 . 10 ) 


the ratio of the observed marginal densities for the two models. For large 
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sample sizes n, the necessary integrals over 0, may be available conve¬ 
niently and accurately via asymptotic approximation, producing estimated 
Bayes factors accurate to order 0{l/n), where n is the number of factors 
contributing to the likelihood function. In any case, equation (6.2) reveals 
that asymptotic approximations are helpful in discovering the sensitivity 
of Bayes factors to prior specification, whether they played a role in the 
original calculation or not. 

For moderate sample sizes n or reasonably challenging models, how¬ 
ever, such approximations are not appropriate, and sampling-based meth¬ 
ods must be used to obtain estimates of the marginal likelihoods needed to 
evaluate BF. This turns out to be a surprisingly difficult problem: unlike 
posterior and predictive distributions, marginal distributions are not easily 
estimated from the output of an MCMC algorithm. As a result, many dif¬ 
ferent approaches have been suggested in the literature, all of which involve 
augmenting or otherwise "tricking" a sampler into producing the required 
marginal density estimates. As with our Subsection 5.4.6 discussion of con¬ 
vergence diagnostics, we do not attempt a technical review of every method, 
but we do comment on their strengths and weaknesses so that the reader 
can judge which will likely be the most appropriate in a given problem 
setting. Kass and Raftery (1995) provide a comprehensive review of Bayes 
factors, including many of the computational methods described below. 


6.3.1 Direct methods 


In what follows, we suppress the dependence on the model indicator M in 
our notation, since all our calculations must be repeated for both models 
appearing in expression (6.10). Observe that since 

P(y) = j f(y I 9)p(9)d0 , 

we could generate observations {9 <a} ) G ]=1 from the prior and compute the 
estimate 

1 G 

p( y) = |0 U) ), (6 - n) 

0=1 

a simple Monte Carlo integration. Unfortunately, the conditional likelihood 
/(y ! 9) will typically be very peaked compared to the prior p(9), so that 
(6.11) will be a very inefficient estimator (most of the terms in the sum will 
be near 0). A better approach would be to use samples from the posterior 
distribution, as suggested by Newton and Raftery (1994). These authors 
develop the estimator 


p( y) = 


IV_-_- 

G Uf{ y\o (g) ) 


( 6 . 12 ) 
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the harmonic mean of the posterior sample conditional likelihoods. This ap¬ 
proach, while efficient, can be quite unstable since a few of the conditional 
likelihood terms in the sum will still be near 0. Theoretically, this difficulty 
corresponds to this estimator's failure to obey a Gaussian central limit the¬ 
orem as G —* oo. To correct this, Newton and Raftery suggest a compromise 
between methods (6.11) and (6.12) wherein we use a mixture of the prior 
and posterior densities for each model, p(6) = 6p(9) -f- (1 — <5)p(0|y) as an 
importance sampling density. Defining w(9) = p(9)/p(9). we then have 

, Ef=i /(y I ei9) ) w{o (9) ) 

p(y) = ~ ~~ ^ g—-' 

22g=l w{9 (9> ) 

This estimator is more stable and does satisfy a central limit theorem, 
although in this form it does require sampling from both the posterior and 
the prior. 

A useful generalization of (6.12) was provided by Gelfand and Dey (1994), 
who began with the identity 

Wy)V ^ S 7mrn my)m ' 


which holds for any proper density h. Given samples 9 i9> from the posterior 
distribution, this suggests the estimator 


1 h{0 {g) ) 

G f{y\o {9) ) p(0 {9) ) 


(6.13) 


Notice that taking h(9) = p(0), the prior density, produces (6.12). How¬ 
ever, this is a poor choice from the standpoint of importance sampling 
theory, which instead suggests choosing h to roughly match the posterior 
density. Gelfand and Dey suggest a multivariate normal or t density with 
mean and covariance matrix estimated from the samples. Kass and 
Raftery (1995) observe that this estimator does satisfy a central limit the¬ 
orem provided that 


f(y\e)p(0) 


d8 < oo , 


i.e., the tails of h are relatively thin. The estimator does seem to perform 
well in practice unless 9 is of too high a dimension, complicating the selec¬ 
tion of a good h density. 


6.3.2 Using Gibbs sampler output 

In the case of models fit via Gibbs sampling with closed-form full condi¬ 
tional distributions. Chib (1995) offers an alternative that avoids the spec¬ 
ification of the h function above. The method begins by simply rewriting 
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Bayes' rule as 


p(y) = 


f{y\0)p(0) 


p(0 |y) 

Only the denominator on the right-hand side is unknown, so an estimate 
of the posterior would produce an estimate of the marginal density, as we 
desire. But since this identity holds for any 0 value, we require only a 
posterior density estimate at a single point - say. O' Thus we have 

log Ply' = log/(y O') -t-logp(0') - logp(0'|y) , (6.14) 


where we have switched to the log scale to improve computational accuracy. 
While in theory O' L arbitrary. Chib (1995) suggests choosing it as a point 
of high posterior density, again to maximize accuracy in (6.14). 

It remains to show how to obtain the estimate p(0'|y)- We first describe 
the technique in the case where the parameter vector can be decomposed 
into two blocks (similar to the data augmentation scenario of Tanner and 
Wong. 1987). That is, we suppose that 0 — (#1,02), where p( 0 i|y, 02 ) and 
p (0 2 |y,#i) are both available in closed form. Writing 

p(0‘ i, 0 2 |y) = p( e 2 |y> 0 'i)p(°[\y) , (6.15) 

we observe that the first term on the right-hand side is available explicitly at 
O', while the second can be estimated via the "Rao-Blackwellized" mixture 
estimate (5.16), namely. 


p(°' lly) = ^Ylp( d ' lly- 6 '^). 

5=1 

since 0 ( / j ~p{0 2 \y), g = 1, . . ., G. Thus our marginal density estimate in 
(6.14) becomes 


logp(y) = log/(y|0i,0' 2 ) + log p(0[,0 2 ) -logp^ly,#!) - logp(0j|y) . 

Exponentiating produces the final marginal density estimate. 

Next suppose there are three parameter blocks, 9 — (0 \, 0 2 , 83 ). The 
decomposition of the joint posterior density (6.15) now becomes 

p(0' v 0' 2 .0' 3 \y) = p(0' 3 \y.0 , 1 ,0' 2 )p(0' 2 \y,0' 1 )p(0' 1 \y) . 


Again the first term is available explicitly, while the third term may be 
estimated as a mixture of />(|y, Olf 1 . 6 l 3 31 ), g = 1, . . G, similar to (6.16) 
above. For the second term, we may write 

P(0 2 |y,0l) = j p{S l 2 \y, 8 \, 03 )p(e 3 \y, 8 ' l )d 03 , 

suggesting the estimator 


p(02ly>0'i) = ^Ylpi 6 '-! |y.0i,03 (i7) ) 


5=1 
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where 6 3^ ~p(^3|y,^i)- Such draws are not available from the original 
posterior sample, which instead contains 0g ~ p{ 61.31y). However, we may 
produce them simply by continuing the Gibbs sampler for an additional G 
iterations with only two full conditional distributions, namely 

P{& 2|y,0i,03) andp(0 3 |y,0 , 1 ,0 2 ) . 

Thus, while additional sampling is required, new computer code is not; 
we need only continue with a portion of the old code. The final marginal 
density estimate then arises from 

logp(y) = log/(y|0i,02,0' 3 ) + logp(0i,0' 2 ,03) 

— logp(0' 3 |y. 8 \. 0' 2 ) -logp(0' 2 |y,0i) -logp(0i|y) . 

The extension to B parameter blocks requires a similar factoring of the 
joint posterior into B components, with (B - 1 ) Gibbs sampling runs of 
G samples each to estimate the various factors. We note that clever par¬ 
titioning of the parameter vector into only a few blocks (each still having 
a closed form full conditional) can increase computational accuracy and 
reduce programming and sampling time as well. 


6.3.3 Using Metropolis-Hastings output 


Of course, equations like (6.16) require us to know the normalizing con¬ 
stant for the full conditional distribution p(d\ |y, 0 2 ), thus precluding their 
use with full conditionals updated using Metropolis-Hastings (rather than 
Gibbs) steps. To remedy this, Chib and Jeliazkov (1999) extend the ap¬ 
proach, which takes a particularly simple form in the case where the pa¬ 
rameter vector 6 can be updated in a single block. Let 


«(0,0*|y) 


min 


p(g*|y)g(fl*,fl|yn 

p(0\y)q(6,e*\y) J 


the probability of accepting a Metropolis-Hastings candidate 8 * generated 
from a candidate density q(8. 8* |y) (note that this density is allowed to 
depend on the data y). Chib and Jeliazkov (1999) then show 


pi$ ly) = -twowT 


(6.17) 


where E\ is the expectation with respect to the posterior p(0|y) and £2 is 
the expectation with respect to the candidate density q( 8 '. 0[y). The nu¬ 
merator is then estimated by averaging the product in braces with respect 
to draws from the posterior, while the denominator is estimated by averag¬ 
ing the acceptance probability with respect to draws from q{8',8 |y), given 
the fixed value 8 '. Note that this calculation does not require knowledge of 
the normalizing constant for p(0|y). Plugging this estimate of (6.17) into 
(6.14) completes the estimation of the marginal likelihood. When there are 


© 2000 by CRC Press LLC 



two more more blocks, Chib and Jeliazkov (1999) illustrate an extended 
version of this algorithm using multiple MCMC runs, similar to the Chib 
(1995) approach for the Gibbs sampler outlined in the previous subsection. 


6.4 Bayes factors via sampling over the model space 


The methods of the previous section all seek to estimate the marginal 
density p(y) for each model, and subsequently calculate the Bayes factor 
via equation (6.10). They also operate on a posterior sample that has al¬ 
ready been produced by some noniterative or Markov chain Monte Carlo 
method, though the methods of Chib (1995) and Chib and Jeliazkov (1999) 
will often require multiple runs of slightly- different versions of the MCMC 
algorithm to produce the necessary- output. But for some complicated or 
high-dimensional model settings, such as spatial models using Markov ran¬ 
dom field priors (which involve large numbers of random effect parameters 
that cannot be analytically integrated out of the likelihood nor readily 
updated in blocks), these methods may be infeasible. 

An alternative approach favored by many- authors is to include the model 
indicator M as a parameter in the sampling algorithm itself. This of course 
complicates the initial sampling process, but has the important benefit 
of producing a stream of samples {.U' 3 '}^ =1 from p(M\y), the marginal 
posterior distribution of the model indicator. Hence the ratio 


P(M = j|y) = 


number of M' 91 = j 

-, j = 1 

total number of A/G) 


(6.18) 


provides a simple estimate of each posterior model probability, which may 
then be used to compute the Bayes factor between any two of the models, 
say j and j via 


BFi 


P(M = j\y)/p{M =/|y) 
P(M = j)/p(M = j>) 


(6.19) 


the original formula used to define the Bayes factor in (2.19). Estimated 
variances of the estimates in (6.18) are easy to obtain even if the M ia > 
output stream exhibits autocorrelation through the hatching formula (5.39) 
or other methods mentioned in Subsection 5.4.6. 


Sampling over the model space alone 

Most of the methods we will discuss involve algorithmic searches over both 
the model and the parameter spaces simultaneously. This is of course a nat¬ 
ural way to think about the problem, but like any other augmented sampler 
such an approach risks a less well-identified parameter space, increased cor¬ 
relations, and hence slower convergence than a sampler operating on any 
one of the models alone. Moreover, if our interest truly lies only in com¬ 
puting posterior model probabilities p(M = /|y) or a Bayes factor, the 
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parameter samples are not needed, relegating the 9j to nuisance parame¬ 
ter status. 

These thoughts motivate the creation of samplers that operate on the 
model space alone. This in turn requires us to integrate the Oj out of the 
model before sampling begins. To obtain such marginalized expressions in 
closed form requires fairly specialized likelihood and prior settings, but sev¬ 
eral authors have made headway in this area for surprisingly broad classes 
of models. For example, Madigan and York (1995) offer an algorithm for 
searching over a space of graphical models for discrete data, an approach 
they refer to as Markov chain Monte Carlo model composition, or (MC) 3 . 
Raftery, Madigan, and Floeting (1997) work instead in the multiple re¬ 
gression setting with conjugate priors, again enabling a model-space-only 
search. They compare the ( MC) 3 approach with the "Occam's Window" 
method of Madigan and Raftery (1994). Finally, Clyde, DeSimone, and 
Parmigiani (1996) use importance sampling (not MCMC) to search for the 
most promising models in a hierarchical normal linear model setting. They 
employ an orthogonalization of the design matrix which enables impressive 
gains in efficiency over the Metropolis-based (MC) 3 method. 


Sampling over model and parameter space 


Unfortunately, most model settings are too complicated to allow the entire 
parameter vector 0 to be integrated out of the joint posterior in closed 
form, and thus require that any MCMC model search be over the model 
and parameter space jointly. Such searches date at least to the work of 
Carlin and Poison (1991), who included M as a parameter in the Gibbs 
sampler in concert with the scale mixture of normals idea mentioned in 
Subsection 6.1.1. In this way they computed Bayes factors and compared 
marginal posterior densities for a parameter of interest under changing 
specification of the model error densities and related prior densities. Their 
algorithm required that the models share the same parametrization. how¬ 
ever. and so would not be appropriate for comparing two different mean 
structures (say, a linear and a quadratic). George and McCulloch (1993) 
circumvented this problem for multiple regression models by introducing 
latent indicator variables which determine whether or not a particular re¬ 
gression coefficient may be safely estimated by 0. a process they referred to 
as stochastic search variable selection (SSVS). Unfortunately, in order to 
satisfy the Markov convergence requirement of the Gibbs sampler, a regres¬ 
sor can never completely "disappear" from the model, so the Bayes factors 
obtained necessarily depend on values of user-chosen tuning constants. 
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6.4.1 Product space search 


Carlin and Chib (1995), whom we abbreviate as "CC", present a Gibbs 
sampling method that avoids these theoretical convergence difficulties and 
accommodates completely general model settings. Suppose there are K 
candidate models, and corresponding to each, there is a distinct parameter 
vector 9j having dimension n 3 , j = 1, . . . , K. Our interest lies in p(M = 
j I y ). j = 1, • • K the posterior probabilities of each of the K models, and 
possibly the model-specific posterior distributions p( Oj\M = j , y) as well. 

Supposing that Gj £ the approach is essentially to sample over 

the model indicator and the Product space HjLi This entails viewing 
the prior distributions as model specific and part of the Bayesian model 
specification. That is, corresponding to model j we write the likelihood as 
f{y\6j. M = j) and the prior as p(Gj | M = j). Since we are assuming that 
M merely provides an indicator as to which particular Gj is relevant to y, we 
have that y is independent of {0^ } given that M = j. In addition, since 
our primary goal is the computation of Bayes factors, we assume that each 
prior p(6j\M = j) is proper (though possibly quite vague). For simplicity 
we assume complete independence among the various Gj given the model 
indicator M, and thus may complete the Bayesian model specification by 
choosing proper "pseudopriors," p(Gj | M ^ j). Writing G — {G i .,8k}, 
our conditional independence assumptions imply that 

p(y\M = j) = J f(y\G,M = j)p(8\M = j)dG 

= Jf(y\8j,M = j)p(8j\M = j)dGj, 

and so the form given to p(Gj \M / j) is irrelevant. Thus, as the name 
suggests, a pseudoprior is not really a prior at all, but only a conveniently 
chosen linking density, used to completely define the joint model specifica¬ 
tion. Hence the joint distribution of y and Q given M =j is 


p(y,G,M = j) 


K 


i\M = j) 


n 3 > 


,i = l 


where nj = P(M = j) is the prior probability assigned to model j, satisfy¬ 
ing £f=i = i- 

In order to implement the Gibbs sampler, we need the full conditional 
distributions of each Gj and M. The former is given by 


p(Gj\Gi^j,M,y) <x 


f(y\9j, M = j)p(Gj\M = j) , M = j 
p(Gj | M ^ j) , M / j 


That is, when M — j we generate from the usual model j full conditional; 
when M ^ j we generate from the pseudoprior. Both of these generations 
are straightforward provided p(Gj |M = j) is taken to be conjugate with its 
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likelihood. The full conditional for M is 


P(M = j\6,y) = 


f(y\dj , M = j) {nf=i P{8i\M = j)} 7 Tj 
ZLi f(yl&k,M = k) = k) j 7 r* 


( 6 . 20 ) 


Since . 11 is a discrete finite parameter, its generation is routine as well. 
Hence all the required full conditional distributions are well defined and, 
under the usual regularity conditions (Roberts and Smith, 1993), the al¬ 
gorithm will produce samples from the correct joint posterior distribution. 
In particular, equations (6.18) and (6.19) can now be used to estimate the 
posterior model probabilities and the Bayes factor between any two models, 
respectively. 


Notes on implementation 

Notice that, in contrast to equation (6.18) summarization of the totality 
of 0\ ‘ n samples is not appropriate. This is because what is of interest in 
our setting is not the marginal posterior densities p{ 0, |y), but rather the 
conditional posterior densities p(6j \M = j, y). However, suppose that in 
addition to 6 we have a vector of nuisance parameters rp common to all 
models. Then the fully marginal posterior density of rj, p{ //|y), may be of 
some interest. Since the data are informative about rj regardless of the value 
of M, a pseudoprior for r) is not required. Still, care must be taken to insure 
that i] has the same interpretation in both models. For example, suppose 
we wish to choose between the two nested regression models 

M = 1 : yi = a + €i, €i Normal(0, <x 2 ) , 

and 

M— 2: yi - a + fdxi + €i, e,- Normal(0, r 2 ) , 

both for i = 1 In the above notation, 6i = cr, 6 2 = (P-t). and 

t] = a. Notice that a is playing the role of a grand mean in model 1. but is 

merely an intercept in model 2. The corresponding posteriors could be quite 
different if, for example, the observed ip values were centered near 0. while 
those for Xi were centered far from 0. Besides being unmeaningful from 
a practical point of view, the resulting p(ajy) would likely be bimodal, 
a shape that could wreak great havoc with convergence of the MCMC 
algorithm, since jumps between M — 1 and 2 would be extremely unlikely. 

Poor choices of the pseudopriors p(0j\M ^ j) can have a similar dele¬ 
terious effect on convergence. Good choices will produce 0\ 3 values that 
are consistent with the data, so that p(M = j\6. y) will still be reasonably 
large at the next M update step. Failure to generate competitive pseu¬ 
doprior values will again result in intolerably high autocorrelations in the 
chain, hence slow convergence. As such, matching the pseudopriors 
as nearly as possible to the true model-specific posteriors is recommended. 
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This can be done using first-order (normal) approximations or other para¬ 
metric forms designed to mimic the output from K individual preliminary 
MCMC runs. Note that we are not using the data to help select the prior, 
but only the pseudoprior. 

If for a particular dataset one of the p(M = y'ly) is extremely large, 
the realized chain will exhibit slow convergence due to the resulting near¬ 
absorbing state in the algorithm. In this case, the Tty may be adjusted to 
correct the imbalance; the final Bayes factor computed from (6.10) will still 
reflect the true odds in favor of M = j suggested by the data. 

Finally, one might plausibly consider the modified version of the above 
algorithm that skips the generation of actual pseudoprior values, and simply 
keeps 0 { . g) at its current value when M<» ^ j. While this MCMC algorithm 
is no longer a Gibbs sampler. Green and O'Hagan (1998) show that it does 
still converge to the correct stationary distribution (i.e.. its transition kernel 
does satisfy detailed balance). Under this alternative we would be free to 
choose p(8j\M — i) — p{ 8j j M — j ) for all i, meaning that all of these 
terms cancel from the M update step in equation (6.20). This pseudoprior- 
free version of the algorithm thus avoids the need for preliminary runs, 
and greatly reduces the generation and storage burdens in the final model 
choice run. Unfortunately, Green and O'Hagan also find it to perform quite 
poorly, since without the variability in candidate 8j 3> values created by the 
pseudopriors, switches between models are rare. 


6.4.2 "Metropolized"product space search 

Dellaportas, Forster, and Ntzoufras (1998) propose a hybrid Gibbs-Metrop- 
olis strategy. In their strategy, the model selection step is based on a pro¬ 
posal for a move from model j to j', followed by acceptance or rejection 
of this proposal. That is, the method is a "Metropolized Carlin and Chib" 
(MCC) approach, which proceeds as follows; 

1. Let the current state be ( j,8j ), where 8j is of dimension rij. 

2. Propose a new model j' with probability h(j,j'). 

3. Generate 8p from a pseudoprior p(8j<\M ^ /) as in Carlin and Chib's 
product space search method. 

4. Accept the proposed move (from j to j') with probability 

• L f{y\6j‘,M =) r )p{6j:\M =)’)p(8 3 \M = j) \ 

aj ^' mm 1 ’ j)p(0j\M = j)p(8p\M j ■ 

Thus by "Metropolizing" the model selection step, the MCC method 
needs to sample only from the pseudoprior for the proposed model j'. In 
this method, the move is a Gibbs step or a sequence of Gibbs steps when 
j 1 ~ j. Posterior model probabilities and Bayes factors can be estimated 
as before. 
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6.4.3 Reversible jump MCMC 


This method, originally due to Green (1995), is another strategy that sam¬ 
ples over the model and parameter space, but which avoids the full product 
space search of the Carlin and Chib (1995) method (and the associated 
pseudoprior specification and sampling), at the cost of a less straightfor¬ 
ward algorithm operating on the union space, .Ad x (J/gA-t ®; • ^ generates 
a Markov chain that can "jump" between models with parameter spaces 
of different dimensions, while retaining the aperiodicity, irreducibility, and 
detailed balance conditions necessary for MCMC convergence. 

A typical reversible jump algorithm proceeds as follows. 

1. Let the current state of the Markov chain be (j, Oj ). where 9j is of 
dimension 

2. Propose a new model j' with probability h(j,jj. 

3. Generate u from a proposal density q(u\9-, ,],]')■ 

4. Set (Op, u 7 ) = gj i u ), where g h p is a deterministic function that is 
1-1 and onto. This is a "dimension matching" function, specified so that 
rij + dim(u)=nji + dim{\i'). 

5. Accept the proposed move (from j to j j with probability a, <, which 
is the minimum of 1 and 

/(y|fl',,M = j l )p(9' ] ,\M = i')y ] 'h(j',j)q(xi l \9p,j l ,j) dg(9j,u) 
f(y\0],M =j)p(9j\M = j)^jh{jJ>)q(u\9j,jJ') 0(8,, u) 


When j’ = j, the move can be either a standard Metropolis-Hastings or 
Gibbs step. Posterior model probabilities and Bayes factors may be esti¬ 
mated from the output of this algorithm as in the previous subsection. 

The "dimension matching" aspect of this algorithm (step 4 above) is a 
bit obscure and merits further discussion. Suppose we are comparing two 
models, for which 0 \ £ 5ft and 6 2 £ -ft 2 , if 0 } is a subvector of 62 , then when 
moving from j = 1 toy' = 2, we might simply draw u ~q(u) and set 


9 2 = (At ! u ) ■ 

That is, the dimension matching g is the identity function, and so the 
Jacobian in step 5 is equal to 1. Thus if we set h{ 1,1) = h( 1, 2) = h{ 2. 1) = 
h{ 2,2) = 1/2, we have 


«i-^2 — min 


f(y\9' 2 ,M =2)p(9' 2 \M =2 )x, \ 

f(y\9 u M = l)p(9i\M = l)xiq(u)) 


with a corresponding expression for ct 2 — 1 ■ 

In many cases, however, 9i will not naturally be thought of as a subvector 
of 02 . Green (1995) considers the case of a changepoint model (see e.g. 
Table 5.7 and the associated exercises in Chapter 5). in which the choice is 
between a time series model having a single, constant mean level 9 \, and 
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one having two levels - say, @ 2,1 before the changepoint and 9 2,2 afterward. 
In this setting, when moving from Model 2 to 1, we would not likely want 
to use either # 2 ,1 or B-> 2 as the proposal value 0[. A more plausible choice 
might be 

e[ = 9 ^i±hi , ( 6 . 21 ) 

since the average of the pre- and post-changepoint levels should provide a 
competitive value for the single level in Model 1. To ensure reversibility of 
this move, when going from Model 1 to 2 we might sample u ~ q(u) and 
set 

<? 2 ,i — 61 — u and 0' 22 — 0 i + m , 

since this is a 1 -1 and onto function corresponding to the deterministic 
down move (6.21). 

Several variations or simplifications of reversible jump MCMC have been 
proposed for various model classes; see e.g. Richardson and Green (1997) 
in the context of mixture modeling, and Knorr-Held and Rasser (2000) for 
a spatial disease mapping application. Also, the "jump diffusion" approach 
of Phillips and Smith (1996) can be thought of as a variant on the reversible 
jump idea. 

As with other Metropolis-Hastings algorithms, transformations to var¬ 
ious parameters are often helpful in specifying proposal densities in re¬ 
versible jump algorithms (say, taking the log of a variance parameter). It 
may also be helpful to apply reversible jump to a somewhat reduced model, 
where we analytically integrate certain parameters out of the model and use 
(lower-dimensional) proposal densities for the parameters that remain. We 
will generally not have closed forms for the full conditional distributions of 
these "leftover" parameters, but this is not an issue since, unlike the Gibbs- 
based CC method, reversible jump does not require them. Here an example 
might be hierarchical normal random effects models with conjugate priors: 
the random effects (and perhaps even the fixed effects) may be integrated 
out, permitting the algorithm to sample only the model indicator and the 
few remaining (variance) parameters. 

6.4.4 Using partial analytic structure 

Godsill (1997) proposes use of a "composite model space," which is es¬ 
sentially the setting of Carlin and Chib (1995) except that parameters are 
allowed to be "shared" between different models. A standard Gibbs sampler 
applied to this composite model produces the CC method, while a more 
sophisticated Metropolis-Hastings approach produces a version of the re¬ 
versible jump algorithm that avoids the "dimension matching" step present 
in its original formulation (see step 4 in Subsection 6.4.3 above). This step is 
often helpful for challenging problems (e.g., when moving to a model con- 
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taining many parameters whose values would not plausibly equal any of 
those in the current model), but may be unnecessary for simpler problems. 

Along these lines, Godsill (1997) outlines a reversible jump method that 
takes advantage of partial analytic structure (PAS) in the Bayesian model. 
This procedure is applicable when there exists a subvector (9ji)u of the 
parameter vector 9n for model j ' such that p((9ji)u\(9ji)-u ,M =j',y) 
is available in closed form, and in the current model j, there exists an 
equivalent subvector ( 0, )-u (the elements of 9j not in subvector U of the 
same dimension as Operationally, 

1. Let the current state be (j, 9j), where 6j is of dimension 

2. Propose a new model /' with probability h(J, j r ). 

3. Set = (9j)_ u , 

4. Accept the proposed move with probability 


(Yj < = min 


p(j , \{0j')-u-y)h(j' .j) } 
p(j\(^j)-u,y)h(j,j') J ’ 


( 6 . 22 ) 


where p(j\(9j)_ u ,y) = Jp(j, (9 j ) u \(9 j ). u , y) d{9j) u . 

5. If the modeLmove is accepted update the parameters of the new model 
(9j')u and (Oji) -u using standard Gibbs or Metropolis-Hastings steps; 
otherwise, update the parameters of the old model (9j)u and (9j)_u 
using standard Gibbs or Metropolis-Hastings steps. 


Note that model move proposals of the form j —. j always have accep¬ 
tance probability 1, and therefore when the current model is proposed, this 
algorithm simplifies to standard Gibbs or Metropolis-Hastings steps. Note 
that multiple proposal densities may be needed for (9 j)u across models 
since, while this parameter is common to all of them, its interpretation and 
posterior support may difFer. Troughton and Godsill (1997) give an exam¬ 
ple of this algorithm, in which the update step of (9j)u is skipped when a 
proposed model move is rejected. 


Summary and recommendations 

Han and Carlin (2000) review and compare many of the methods described 
in this and the previous subsections in the context of two examples, the first 
a simple regression example, and the second a more challenging hierarchical 
longitudinal model (see Section 7.4). The methods described in this section 
that sample jointly over model and parameter space (such as product space 
search and reversible jump) often converge very slowly, due to the difficulty 
in finding suitable pseudoprior (or proposal) densities. Marginalizing ran¬ 
dom effects out of the model can be helpful in this regard, but this tends to 
create new problems: the marginalization often leads to a very complicated 
form for the model switch acceptance ratio (step 5 in Subsection 6.4.3), thus 
increasing the chance of an algebraic or computational error. Even with the 
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marginalization, such methods remain difficult to tune. The user will often 
need a rough idea of the posterior model probabilities p(M = /ly) in or¬ 
der to set the prior model probabilities 7 r ; in such a way that the sampler 
spends roughly equal time visiting each of the candidate models. Prelimi¬ 
nary model-specific runs are also typically required to specify proposal (or 
pseudoprior) densities for each model. 

By contrast, the marginal likelihood methods of Section 6.3 appear rela¬ 
tively easy to program and tune. These methods do not require preliminary 
runs (only a point of high posterior density, 9'), and in the case of the Gibbs 
sampler, only a rearrangement of existing computer code. Estimating stan¬ 
dard errors is more problematic (the authors' suggested approach involves a 
spectral density estimate and the delta method), but simply replicating the 
entire procedure a few times with different random number seeds generally 
provides an acceptable idea of the procedure's order of accuracy. 

In their numerical illustrations, Han and Carlin (2000) found that the 
RJ and PAS methods ran more quickly than the other model space search 
methods, but the marginal likelihood methods seemed to produce the high¬ 
est degree of accuracy for roughly comparable runtimes. This is in keeping 
with the intuition that some gain in precision should accrue to MCMC 
methods that avoid a model space search. 

As such, we recommend the marginal likelihood methods as relatively 
easy and safe approaches when choosing among a collection of standard 
(e.g., hierarchical linear) models. We hasten to add however that the block¬ 
ing required by these methods may preclude their use in some settings, 
such as spatial models using Markov random field priors (which involve 
large numbers of random effect parameters that cannot be analytically 
integrated out of the likelihood nor readily updated in blocks; see Sub¬ 
section 7.8.2). In such cases, reversible jump may offer the only feasible 
alternative for estimating a Bayes factor. The marginal likelihood methods 
would also seem impractical if the number of candidate models were very 
large (e.g., in variable selection problems having 2 P possible models, corre¬ 
sponding to each of p predictors being either included or excluded). But as 
alluded to earlier, we caution that the ability of joint model and parameter 
space search methods to sample effectively over such large spaces is very 
much in doubt; see for example Clyde et al. (1996). 

6.5 Other model selection methods 

The Bayes factor estimation methods discussed in the previous two sections 
require substantial time and effort (both human and computer) for a rather 
modest payoff, namely, a collection of posterior model probability estimates 
(possibly augmented with associated standard error estimates). Besides be¬ 
ing mere single number summaries of relative model worth, Bayes factors 
are not interpretable with improper priors on any components of the pa- 
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rameter vector, since if the prior distribution is improper then the marginal 
distribution of the data necessarily is as well. Even for proper priors, the 
Bayes factor has been criticized on theoretical grounds; see for example 
Gelfand and Dey (1994) and Draper (1995). One might conclude that none 
of the methods considered thus far is appropriate for everyday, "rough and 
ready" model comparison, and instead search for more computationally 
realistic alternatives. 

One such alternative might be more informal, perhaps graphical meth¬ 
ods for model selection. These could be based on the marginal distri¬ 
butions m(yi), / = 1,...,» (as in Berger 1985. p. 199). provided they 
exist. Alternatively, we could use the conditional predictive distributions 
f(Vi\y(i)), i= i,..., n (as in Gelfand et al., 1992), since they will be proper 
when p(0|y(j)) is, and they indicate the likelihood of each datapoint in the 
presence of all the rest. For example, the product (or the sum of the logs) 
of the observed conditional predictive ordinate (CPO) values f(yf bs y,, ( ) 
given in (2.25) could be compared across models, with the larger result in¬ 
dicating the preferred model. Alternatively, sums of the squares or absolute 
values of the standardized residuals d, : given in (2.24) could be compared 
across models, with the model having the smaller value now being pre¬ 
ferred. We could improve this procedure by accounting for differing model 
size - say, by plotting the numerator of (2.24) versus its denominator for a 
variety of candidate models on the same set of axes. In this way we could 
judge the accuracy of each model relative to its precision. For example, an 
overfitted model (i.e., one with redundant predictor variables) would tend 
to have residuals of roughly the same size as an adequate one. but with 
higher variances (due to the "collinearity" of the predictors). We exemplify 
this conditional predictive approach in our Section 8.1 case study. 


6.5.1 Penalized likelihood criteria 


For some advanced models, even cross-validatory predictive selection meth¬ 
ods may be unavailable. For example, in our Section 8.3 case study, the 
presence of certain model parameters identified only by the prior leads to 
an information deficit that causes the conditional predictive distributions 
(2.25) to be improper. In many cases, informal likelihood or penalized like¬ 
lihood criteria may offer a feasible alternative. Loglikelihood summaries are 

easy to estimate using posterior samples { ^, g = 1.G}. since we may 

think of t = log 1 ,(8) as a parametric function of interest, and subsequently 
compute 


^F[logF(0)|y]«i^logi(0'^) 


0=1 

as an overall measure of model fit to be compared across models. To ac¬ 
count for differing model size, we could penalize ( using the same sort 
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of penalties as the Bayesian information (Schwarz) criterion (2.21) or the 
Akaike information criterion (2.22). In the case of the former, for example, 
we would have 

BIC — 2 1 — plog n , 

where as usual p is the number of parameters in the model, and n is the 
number of datapoints. 

Unfortunately, a problem arises here for the case of hierarchical mod¬ 
els: what exactly are p and n? For example, in a longitudinal setting (see 
Section 7.4) in which we have s,- observations on patient i, i = 1, ... , m, 
shall we set n — • S{ (the total number of observations), or n — m (the 

number of patients)? If the observations on every patient were indepen¬ 
dent, the former choice would seem most appropriate, while if they were 
perfectly correlated within each patient, we might instead choose the lat¬ 
ter. But of course the true state of nature is likely somewhere in between. 
Similarly, if we had a collection of rn random effects, one for each patient, 
what does this contribute to pi If the random effects had nothing in com¬ 
mon (i.e., they were essentially like fixed effects), they would contribute the 
full m parameters to p, but if the data (or prior) indicated they were all 
essentially identical, they would contribute little more than one "effective 
parameter" to the total model size p. Pauler (1998) obtains results in the 
case of hierarchical normal linear models, but results for general models 
remain elusive. 

In part to help address this problem, Spiegelhalter et al. (1998) suggest 
a generalization of the Akaike information criterion (AIC) that is based on 
the posterior distribution of the deviance statistic, 

D(6) = -21og/(y|0) + 21og/.(y) , (6.23) 

where /(y|0) is the likelihood function for the observed data vector y given 
the parameter vector 0, and h{ y) is some standardizing function of the data 
alone (which thus has no impact on model selection). In this approach, the 
fit of a model is summarized by the posterior expectation of the deviance, 
D = Eg\y[D], while the complexity of a model is captured by the effective 
number of parameters p D (note that this is often less than the total number 
of model parameters due to the borrowing of strength across individual- 
level parameters in hierarchical models). Below we show that a reasonable 
definition of po is the expected deviance minus the deviance evaluated at 
the posterior expectations, 

p D = E ely [D\ - D(E e |J0]) = D- D(9) . (6.24) 

The Deviance Information Criterion (DIC) is then defined as 

DIC = D + Pd = 2D- D(9) , (6.25) 

with smaller values of DIC indicating a better-fitting model. As with other 
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penalized likelihood criteria, we caution that DIC is not intended for identi¬ 
fication of the "correct" model, but rather merely as a method of comparing 
a collection of alternative formulations (all of which may be incorrect). 

An asymptotic justification of DIC is straightforward in cases where the 
number of observations n grows with respect to the number of parameters 
p, and where the prior p(9) is non-hierarchical and completely specified 
(i e., having no unknown parameters). Here we may expand D(9) around 
0 to give, to second order, 

D{6) x, D(8) - 2(0 - 6) t L' - (0 - 0) T I"(0 - 0) (6.26) 

where L = logp(y|0) =- D(6)/ 2, and L' and L" are the first deriva¬ 
tive vector and second derivative matrix with respect to 0. But from the 
Bayesian Central Limit Theorem (Theorem 5.1) we have that 0 ly is ap¬ 
proximately distributed as N (s, [-L"] _1 j, where 0 = 0 are the max¬ 
imum likelihood estimates such that L' = 0. This in turn implies that 
(0 — 0) T (— L")(6 — 0) has an approximate chi-squared distribution with p 
degrees of freedom. Thus writing /Lion (0) to represent the deviance for a 
non-hierarchical model, from (6.26) we have that 

Aion(0) * D(9) — (0 — 9) T L"(9 — 0) = D{9) + . 

Rearranging this expression and taking expectations with respect to the 
posterior distribution of 0, we have 

P « E ely [D n0 n(8)] - D(9) , (6.27) 

so that the number of parameters is approximately the expected deviance 
D = Eg\ y [Dn 0 n(9 )] minus the fitted deviance. But since A1C = D(0) + '2p. 
from (6.27) we obtain AIC % D+p, the expected deviance plus the number 
of parameters. The DIC approach for hierarchical models thus follows this 
equation and equation (6.27), but substituting the posterior mean 0 for the 
maximum likelihood estimate §. It is a generalisation of Akaike's criterion, 
since for non-hierarchical models 0 « 0, p D « p, and DIC % AIC. DIC 
can also be shown to have much in common with the hierarchical model 
selection tools previously suggested by Ye (1998) and Hodges and Sargent 
(1998), though the DIC idea applies much more generally. 

As with all penalized likelihood criteria, DIC consists of two terms, one 
representing "goodness of fit" and the other a penalty for increasing model 
complexity. DIC is also scale-free; the choice of standardizing function /i(y) 
in (6.23) is arbitrary. Thus values of DIC have no intrinsic meaning; as with 
AIC and BIC, only differences in DIC across models are meaningful. (Of 
course, po does have a scale, namely, the size of the effective parameter 
space.) Besides its generality, another attractive aspect of DIC is that it 
may be readily calculated during an MCMC run by monitoring both 0 
and D(9), and at the end of the run simply taking the sample mean of the 
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simulated values of D, minus the plug-in estimate of the deviance using the 
sample means of the simulated values of G. This quantity can be calculated 
for each model being considered without analytic adaptation, complicated 
loss functions, additional MCMC sampling (say, of predictive values), or 
any matrix inversion. 

While DIC has already been profitably applied in practice (see e.g. 
Erkanli et al., 1999), as with any new data analysis technique, several 
practical questions must be resolved before DIC can become a standard 
element of the applied Bayesian toolkit. For example, DIC is not invari¬ 
ant to parametrization, so (as with prior elicitation) the most plausible 
parametrization must be carefully chosen beforehand. Unknown scale pa¬ 
rameters and other innocuous restructuring of the model can also lead to 
small changes in the computed DIC value. Finally, determining an appro¬ 
priate variance estimate for DIC remains a vexing practical problem. Zhu 
and Carlin (2000) experiment with various delta method approaches to this 
problem in the context of spatio-temporal models of the sort considered in 
Section 7.8.2, but conclude that a "brute force" replication approach may 
be the only suitably accurate method in such complicated settings. That is, 
we would independently replicate the calculation of DIC a large number 
of times N, obtaining a sequence of DIC estimates { DIC \> l = I ,...,N }, 
and estimate VailDIC) by the sample variance, 

_ i v __ 

Var(DIC) = ]T(Z>/C) - DIC) 2 . 

1 i=i 


6.5.2 Predictive model selection 

Besides being rather ad hoc, a problem with penalized likelihood approaches 
is that the usual choices of penalty- function are motivated by asymptotic 
arguments. But for many of the complex and high-dimensional models we 
currently envision, the dimension of the parameter space p increases with 
the sample size n, so the usual asymptotic calculations are inappropriate. 
While DIC does offer a way to estimate the effective sizes of such non¬ 
regular models, we might instead work purely in predictive space, follow¬ 
ing Faud and Ibrahim (1995). Here we shall see that intuitively appealing 
penalties emerge without the need for asymptotics. As in Chapter 2, the 
basic distribution we work with is 

f(ynew\yobs) = j f{yneu,\Q)p{Q\yobs) dO , ( 6 . 28 ) 

where 6 denotes the collection of model parameters, and y new is viewed as 
a replicate of the observed data vector y obs . The model selection criterion 
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first selects a discrepancy function d{y new , y obs ), then computes 
E[d(y 

new > yobs)\y 0 b S ,Mi], (6.29) 

and selects the model which minimizes (6.29). For Gaussian likelihoods, 
Laud and Ibrahim (1995, p. 250) suggested 

d(y n cw i yobs') — (y new y 0 6 S ) (y new y 0 bs) • ( 6 . 30 ) 

For a non-Gaussian generalized linear mixed model, we may prefer to re¬ 
place (6.30) by the corresponding deviance criterion. For example, with a 
Poisson likelihood we would set 


d(y new) y 0 bs) = 2 ^{yi^bs log(j/i,o6 S /2// ,new ) ( Vl.obs Ul,new)} j (6.31) 

l 


where l indexes the components of y. Routine calculation shows that the 
I th term in the summation in (6.31) is strictly convex in y t new if y t obs > 0. 
To avoid problems with extreme values in the sample space, we replace 
(6.31) by 

d(y new i Yobs ) 

~ 2 Hi {(yj .Obs + 5 ) log ( y ,'1 ) _ (yi,obs - l ■ 
Suppose we write 

E[d{y ne w , yobs) \ y obs, Mi] — d(^E\y new \y obsj Mi\,y 0 bs) (6 32) 

+E[d{y new , y 0 bs ) | y 0 bs-■ Mi\ d(E[y new jy^^s, yobs) ■ 

Intuitive interpretations may be given to the terms in (6.32). The left- 
hand side is the expected predictive deviance (EPD) for model M-. The 
first term on the right-hand side is essentially the likelihood ratio statis¬ 
tic with the MLE for E(y new \ B t , Mi), which need not exist, replaced by 
E(y new |yo/'.s, ). Jensen s inequality shows that the second term minus 
the third is strictly positive, and is the penalty associated with M{. This 
difference becomes 

*2 Si (Vl,obs + 2 ) oo\ 

x {log E [y ltnew + ||yoi, s ] - E [log (y l)new + i) | y o4s ] } . 

Again, each term in the summation is positive. A second order Taylor series 
expansion shows that (6.33) is approximately 


£ 


Ul,obs + k 

{E(y, ,new + 5 lyois)] 2 


Fctr(y/ )fieu , 


y obs ) • 


(6.34) 


Hence (6.33) can be viewed as a weighted predictive variability penalty, a 
natural choice in that if Mj is too large (i.e., contains too many explanatory 
terms resulting in substantial multicollinearity), predictive variances will 
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increase. Lastly, (6.34) is approximately 


f 1 J (Vl } new ^{yi,new\yobs)}^ 1 

E(yi,new + \ | Yobs) y0t ' J 

and thus may be viewed as a predictive corrected goodness-of-fit statistic. 

Computation of (6.32) requires calculation of E(yi new |y 0 i, s , Mg) as well 
as i?[log(j// >net( ; + \)\y obs ,Mi}. Such predictive expectations are routinely 
obtained as Monte Carlo integrations; the case study in Section 8.3 provides 
an example. 

Finally, we remark that Gelfand and Ghosh (1998) extend the above 
approach to a completely formal decision-theoretic setting. Their method 
seeks to choose the model minimizing a particular posterior predictive loss. 
Like the less formal method above, the resulting criterion consists of a 
(possibly weighted) sum of a goodness-of-fit term and a model complexity 
penalty term. 


6.6 Exercises 

1. Consider the data displayed in Table 6.1, originally collected by Treloar 
(1974) and reproduced in Bates and Watts (1988). These data record 
the "velocity" y t of an enzymatic reaction (in counts/min/min) as a 
function of substrate concentration X( (in ppm), where the enzyme has 
been treated with Puromycin. 


Case (i) 


Pi 

Case (i) 

Xi 

y> 

1 

0.02 

76 

7 

0.22 

159 

2 

0.02 

47 

8 

0.22 

152 

3 

0.06 

97 

9 

0.56 

191 

4 

0.06 

107 

10 

0.56 

201 

5 

0.11 

123 

11 

1.10 

207 

6 

0.11 

139 

12 

1.10 

200 


Table 6.1 Puromycin experiment data 


A common model for analyzing biochemical kinetics data of this type is 
the Michaelis-Menten model, wherein we adopt the mean structure 

IM =7 + aXi/(0 + *,•) , 

where a, 7 £ and 6 € 3 )i ,+ . In the nomenclature of Example 6.2, we 
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have design matrix 


f?=( A_ ::: _L) 

\ Q+xi 0-\-x n J 

and (3 t = (7, a). For this model, obtain estimates of the marginal poste¬ 
rior density of the parameter a , and also the marginal posterior density 
of the mean velocity at X = 0.5, a substrate concentration not repre¬ 
sented in the original data, assuming that the error density of the data 
is 


(a) normal 

(b) Student's t with 2 degrees of freedom 

(c) double exponential. 


2. For the point null prior partitioning setting described in Subsection 6.1.2, 
show that requiring G EH C as given in (6.4) is equivalent to requiring 

BF < i where BF is the Bayes factor in favor of the null 

hypothesis. 

3. For the interval null prior partitioning setting of Subsection 6.1.2, derive 
an expression for the set of priors H c that correspond to rejecting Ho, 
similar to expression (6.4) for the point null case. 

4. Suppose we are estimating a set of residuals r t using a Monte Carlo 
approach, as described in Section 6.2. Due to a small sample size n, we 
are concerned that the approximation 

£(2/i|y(i)) = J E(yi\d)p(0\y {i) )d6 % J E{y l \8)p(d\y)d0 

will not be accurate. A faculty member suggests the importance sampling 
estimate 

E(yi\y(i)) = 


as an alternative. Is this a practical solution? What other approaches 
might we try? (Hint: See equation (5.8).) 

5. Suppose we have a convergent MCMC algorithm for drawing samples 
from p(#|y) oc /(y|#)7r(0). We wish to locate potential outliers by com¬ 
puting the conditional predictive ordinate fly, |y, ,,) given in equation 
(2.25) for each i = ,n. Give a computational formula we could use 

to obtain a simulation-consistent estimate of f(y, |y(,)). What criterion 
might we use to classify y { as a suspected outlier? 


^ P{0\y)d6 
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6 . In the previous problem, suppose we wish to evaluate the model using the 
model check p' D given in equation (2.28) with an independent validation 
data sample z. Give a computational formula we could use to obtain a 
simulation-consistent estimate of p' D . 

7. Consider again the data in Table 5.4. Define model 1 to be the three vari¬ 
able model given in Example 5.7, and model 2 to be the reduced model 
having rrij = 1 (i.e., the standard logistic regression model). Compute 
the Bayes factor in favor of model 1 using 

(a) the harmonic mean estimator, ( 6 . 12 ) 

(b) the importance sampling estimator, (6.13). 

Does the second perform better, as expected? 

8 . Consider the dataset of Williams (1959), displayed in Table 6.2. For 
n = 42 specimens of radiata pine, the maximum compressive strength 
parallel to the grain was measured, along with the specimen's density, 
Xi, and its density adjusted for resin content, z, (resin contributes much 
to density but little to strength of the wood). For i = 1,. . ., n we wish 
to compare the two models M = 1 and M = 2 where 

M — 1 : yi — a + /?(«< - x) + e,-, e,- Normal(0, a 2 ) , 

and 

M~ 2: yi = j + S(zi - z) + €i, a Normal(0,r 2 ) . 

We desire prior distributions that are roughly centered around the ap¬ 
propriate least squares parameter estimate, but are extremely vague 
(though still proper). As such, we place 

N ((3000, 185) t , Diag( 10 6 ,10 4 )) 

priors on (a,/3) T and (j,6) T and 

IG (3 , (2 • 300 2 )~ 1 ) 

priors on <r 2 and r~ having both mean and standard deviation equal to 
300 2 . 

Compute the Bayes factor in favor of model 2 (the adjusted density 
model) using 

(a) the marginal density estimation approach of Subsection 6.3.2 

(b) the product space search approach of Subsection 6.4.1 

(c) the reversible jump approach of Subsection 6.4.3. 

How do the methods compare in terms of accuracy? Ease of use? Which 
would you attempt first in a future problem? 
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Case (i) 

Vi 

Xi 

Zi 

Case (i) 

Vi 

Xi 

Zi 

1 

3040 

29.2 

25.4 

22 

3840 

30.7 

30.7 

2 

2470 

24.7 

22.2 

23 

3800 

32.7 

32.6 

3 

3610 

32.3 

32.2 

24 

4600 

32.6 

32.5 

4 

3480 

31.3 

31.0 

25 

1900 

22.1 

20.8 

5 

3810 

31.5 

30.9 

26 

2530 

25.3 

23.1 

6 

2330 

24.5 

23.9 

27 

2920 

30.8 

29.8 

7 

1800 

19.9 

19.2 

28 

4990 

38.9 

38.1 

8 

3110 

27.3 

27.2 

29 

1670 

22.1 

21.3 

9 

3160 

27.1 

26.3 

30 

3310 

29.2 

28.5 

10 

2310 

24.0 

23.9 

31 

3450 

30.1 

29.2 

11 

4360 

33.8 

33.2 

32 

3600 

31.4 

31.4 

12 

1880 

21.5 

21.0 

33 

2850 

26.7 

25.9 

13 

3670 

32.2 

29.0 

34 

1590 

22.1 

21.4 

14 

1740 

22.5 

22.0 

35 

3770 

30.3 

29.8 

15 

2250 

27.5 

23.8 

36 

3850 

32.0 

30.6 

16 

2650 

25.6 

25.3 

37 

2480 

23.2 

22.6 

17 

4970 

34.5 

34.2 

38 

3570 

30.3 

30.3 

18 

2620 

26.2 

25.7 

39 

2620 

29.9 

23.8 

19 

2900 

26.7 

26.4 

40 

1890 

20.8 

18.4 

20 

1670 

21.1 

20.0 

41 

3030 

33.2 

29.4 

21 

2540 

24.1 

23.9 

42 

3030 

28.2 

28.2 


Table 6.2 Radiata pine compressive strength data 
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CHAPTER 7 


Special methods and models 


Having learned the basics of specifying, fitting, evaluating, and compar¬ 
ing Bayes and EB models, we are now well equipped to tackle real-world 
data analysis. However, before doing this (in Chapter 8), we outline the 
Bayesian treatment of several special methods and models useful in prac¬ 
tice. We make no claim of comprehensive coverage; indeed, each of this 
chapter's sections could be expanded into a book of its own! Rather, our 
goal is to highlight the key differences between the Bayes/EB and classical 
approaches in these important areas, and provide links to more complete 
investigations. As general Bayesian modeling references, we list the recent 
books by O'Hagan (1994) and Gelman, Carlin, Stern, and Rubin (1995). 

7.1 Estimating histograms and ranks 

Much of this book has concentrated on inferences for coordinate-specific 
parameters: an experimental unit, an individual, a clinic, a geographic re¬ 
gion, etc. We now consider situations where the collection of parameters 
(the ensemble) is of primary or equal interest. For example, policy-relevant 
environmental or public health assessments involve synthesis of informa¬ 
tion (e.g. disease rates, relative risks, operative death rates) from a set of 
related geographic regions. Most commonly, region-specific rates are esti¬ 
mated and used for a variety of assessments. However, estimates of the 
histogram or empirical distribution function (edf) of the underlying rates 
(for example, to evaluate the number of rates above a threshold) or their 
rank ordering (for example, to prioritize environmental assessments) can 
be of equal importance. Other settings in which histograms or ranks might 
be of interest include estimating clinic-specific treatment effects, hospital 
or physician specific surgical death rates, school-specific teaching effective¬ 
ness, or county-specific numbers of children in poverty. For consistency, 
throughout this section we will always refer to the units as "regions." 

Broadly put, our goal is to estimate the edf of the latent parameters that 
generated the observed data. One may also be interested in the individual 
parameters, but the initial goal is estimation of the edf (see Tukey, 1974; 
Louis, 1984; Laird and Louis, 1989; and Devine, Halloran, Louis, 1994). As 
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shown by Shen and Louis (1998), Conlon and Louis (1999), and references 
therein, while posterior means are the most obvious and effective estimates 
for region-specific parameters, their edf is underdispersed and never valid 
for estimating the edf (or histogram) of the parameters, while the edf of the 
coordinate-specific MLEs is overdispersed. Ranking the posterior means 
can also produce suboptimal estimates of the parameter ranks; effective 
estimates of the parameter ranks should target them directly. Fortunately, 
structuring via Bayesian decision theory guides development of effective 
estimates of histograms and ranks, as we now describe. 

7.1.1 Model and inferential goals 

Consider again the two-stage, hierarchical model with a continuous prior 
distribution: 

q a d 

1 ’ (7-D 

Yi\0i ~ fi(Yi | $i), 

for j — 1, . . ., k. Let g be the density function of G. Although other loss 
functions may be more appropriate, to structure estimates in a mathemat¬ 
ically convenient manner we use squared error loss (SEL) for the relevant 
estimation goal. Our goals are to estimate the individual 9i , their ranks, 
and the histogram associated with their edf; we now consider each in turn. 

Estimating the Individual Os 

Under squared error loss, the posterior means (PM), 

ef M = r h = E[0i I Y { ] . 

are optimal. These are the "traditional" parameter estimates. In our sub¬ 
sequent development, all estimates are posterior means of target features, 
but we reserve the "PM" label for estimates of the individual Os. 

Ranking 

Laird and Louis (1989) and Stern and Cressie (1999) study the estimation of 
ranks. The rank vector R = (Ri, ..., Rk) has components R, — rank(dg) = 
} where f ., is the indicator function. The smallest 0 has rank 
1. The posterior mean rank, . is optimal under squared error loss: 

k 

Ri = E[Ri |Y] = ^P[^>^ | Y] ■ 

i=i 

See Laird and Louis (1989) for the posterior covariance of the ranks. 

Generally, the R, t are not integers. Though their mean, (k + l)/2, is the 
same as for integer ranks, their spread is smaller, and often substantially 
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smaller. For example, the largest Ri can be far smaller than k and the 
smallest can be far larger than 1. The non-integer feature of the Ri is 
attractive, because integer ranks can overrepresent distance and underrep¬ 
resent uncertainty, but we will need integer ranks. To obtain them, rank 
the Ri , producing 

Ri — rank (77,;) . 


Estimating, the EDF 


Let Gk denote the edf of the Os that underlie the current data set. Then 
G k (t I e) = 7 {0,<t} ; — oo < t < oo. With A(t) a candidate esti¬ 

mate of Gk(t), under weighted integrated squared error loss, J ir(/)[A( 7 ) — 
Gk(t)]~dt, the posterior mean is optimal: 

G k (t) = E[G k (t) I Y] - ^ P(0i < t | Yi). 

With the added constraint that A(t) is a discrete distribution with at most 
k mass points, Shen and Louis (1998) show that the optimal estimator G k 
has mass J- at 

k 

Ui = Gk 1 (^y j =l,...,k. ( 7 . 2 ) 

We now show that both the PMs and the Fs themselves are inappropriate 
as estimates of Gk • Except when 0 is the conditional mean of Yi , the edf 
based on the Y s may have no direct relevance to the distribution of 0. Even 
when the Y s are relevant (when 0 is the mean parameter), the histogram 
produced from the Yi is overdispersed. It has the appropriate mean, but its 
variance is E[V(Y \ 6)\ + V[E(Y \ 0)} = E[V(Y | 0)] + V[0\, the first term 
produces overdispersion. 

On the other hand, the histogram produced from the PMs is underdis¬ 
persed. To see this, let Vi — V(0i \ Yi) and recall r/i = E(0 \ Yi). Then the 
moments produced by Gk (and approximately those produced by G k ) are 


mean 

variance 


j tdGk(t) = l^2vi = V-. 

J(t - T]) 2 dG k (t) 

Ie v <+Ie^-o) 2 - 


(7.3) 


The sample variance of the PMs produces only the second term, so the 
histogram based on them is underdispersed. To correct for this underdis¬ 
persion, Louis (1984) and Ghosh (1992) derived constrained Bayes (CB) 
estimates. These estimates minimize posterior expected (weighted) SEL 
for individual parameters subject to the constraint that their sample mean 
and variance match those in (7.3). Specifically, CB estimates (oq,.... a k ) 
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minimize 


X! w i E K a i ~ I Y ] = X] W *'( a «. ~ ’?*') 2 + X] UW ’ 

subject to the constraints 

i£«< = jEw 

j E ,,! ' - j E+ jE (w-w 1 - 

The equally-weighted case («;,• = 1) produces the closed-form solution 

2 . 


0$ 


CB 


v + 


1 + 


E 1 


E(% - ??) 2 . 


(»?. - 7)- 


These CB estimates start with the PMs and adjust them to have the vari¬ 
ance in (7.3). Since the term in square brackets is greater than 1. the 
estimates are more spread out around f) than are the PMs. Table 3.3 and 
Figure 3.1 present an example of these estimates. 

Building on the CB approach, Shen and Louis (1998) defined a ~G then 
rank" (GR) approach by assigning the mass points of G i (equation (7.2) 
to coordinates using the ranks R t 


0f R 



(7.4) 


The GR estimates have an edf equal to G k and ranks equal to the R, and so 
are SEL-optimal for these features. Though there is no apparent attention 
to reducing coordinate-specific SEL, assigning the ['s to coordinates by a 
permutation vector z to minimize - iy) 2 produces the same assign¬ 

ments as in (7.4). This equivalence indicates that the GR approach pays 
some attention to estimating the individual Bs and provides a structure for 
generalizing the approach to multivariate coordinate-specific parameters. 

As an added benefit of the GR approach, the 0f R are monotone transform 
- gr „ 

equivariant, i.e., [A(0)] = h(9f R ) for monotone h(-). 


7.1.2 Triple goal estimates 

Since estimates of coordinate-specific parameters are inappropriate for pro¬ 
ducing histograms and ranks, no single set of values can simultaneously 
optimize all three goals. However, in many policy settings, communication 
and credibility is enhanced by reporting a single set of estimates with good 
performance for all three goals. To this end, Shen and Louis (1998) in¬ 
troduce "triple-goal" estimation criteria. Effective estimates are those that 
simultaneously produce a histogram that is a high quality estimate of the 
parameter histogram, induced ranks that are high quality estimates of the 
parameter ranks, and coordinate-specific parameter estimates that perform 
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well. Shen and Louis (1998) evaluate and compare candidate estimates in¬ 
cluding the Y s, the PMs, CB, and GR estimates. Shen and Louis (1998) 
study properties of the various estimates, while Conlon and Louis (1999), 
Louis and Shen (1999), and Shen and Louis (2000) show how they work in 
practice. Here we summarize results. 

As Shen and Louis (1998) show in the exchangeable case, the edf of the 
CB estimates is consistent for G if and only if the marginal distribution of 
ri differs from G only by a location/scale change. The Gaussian/Gaussian 
model with 9 the mean parameter qualifies, but in general CB estimates 
will not produce a consistent edf. However, Gk is consistent for G when¬ 
ever G is identifiable from the marginal distributions (see Lindsay, 1983). 
Consistency is transform invariant. 

PM dominates the other approaches for estimating individual 9s under 
SEL, as it must. However, both GR and CB increase SEL by no more than 
32% in all simulation cases, and both outperform use of the MLE (the Yj 
Comparison of CB and GR is more complicated. As predicted by the Shen 
and Louis (1998) results, for the Gaussian-Gaussian case, CB and GR are 
asymptotically equivalent, while for the gamma-gamma case, model CB 
has a slightly better large-sample performance than does GR. However, for 
finite k, simulations show that GR can perform better than CB for skewed 
or long-tailed prior distributions. 

As it must, GR dominates other approaches for estimating G k and the 
advantage of GR increases with sample size. The Gaussian-Gaussian model 
provides the most favorable setup for CB relative to GR. Yet, the risk for 
CB exceeds GR by at least 39%. Performance for CB is worse in other 
situations, and in the Gaussian-lognormal case CB is even worse than PM. 
Importantly, the GR estimates are transform equivariant, and their perfor¬ 
mance in estimating Gk is transform invariant. 

A start has been made on approaches to multiple-goal estimates for mul¬ 
tivariate, coordinate-specific parameters. Ranks are not available in this 
situation, but coordinate-specific estimates based on the histogram are 
available. First, estimate Gk by a A'-point discrete G k that optimizes some 
histogram loss function. Then assign the mass points of G k to coordinates 
so as to minimize SEL for the individual parameters. In the univariate 
parameter case, these estimates equal the GR estimates. 

Finally, in addition to possible non-robustness to misspecification of the 
sampling likelihood (a potential problem shared by all methods), inferences 
from hierarchical models may not be robust to prior misspecification, es¬ 
pecially if coordinate-specific estimates have relatively high variance. This 
non-robustness is most definitely a threat to the GR estimates, which de¬ 
pend on extreme percentiles of the posterior distribution. In the next sub¬ 
section we discuss "robustifying" Bayes and empirical Bayes procedures. 
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7.1.3 Smoothing and robustness 

Subsection 3.4.3 presented the EM algorithm approach to finding the non- 
parametric maximum likelihood (NPML) estimate of the prior distribution. 
Using it instead of a parametric prior (e.g., a conjugate) has been shown 
to be quite efficient and robust, especially for inferences based on only the 
first two moments of the posterior distribution (Butler and Louis, 1992). 
Since the NPML is always discrete with at most k mass points, smoothing 
is attractive for increased efficiency with small sample sizes. Approaches in¬ 
clude directly smoothing the NPML, adding "pseudo-data" to the dataset 
and then computing the NPML, and model-based methods such as use of 
a Dirichlet process prior for G (see Section 2.5). 

Laird and Louis (1991) propose an alternative, which they call smoothing- 
by-roughening (SBR), based on the EM algorithm method of finding the 
NPML. Shen and Louis (1999) provide additional evaluations. Though ini¬ 
tiating the EM by a discrete prior with k mass points is best for producing 
the NPML, a continuous prior can initiate the EM. In this situation, the 
initiating prior G^ 0) and all subsequent iterates (G (i/| )are continuous. As 
v — > oo, iterates converge very slowly to the NPML. Therefore, stopping 
the EM at iteration i/ k produces an estimate of the prior that is smoothed 
toward G {{>> (equivalently, roughened away from G |0) ) with more smoothing 
for smaller values of i> k . Note that the EM recursion uses the conditional 
expectation that produces Q k , Specifically, 

= E[G k {t) | Y, G 1 "'] . (7.5) 

where G^ v 1 plays the role of the prior. 

As an example of smoothing-by-roughening and its relation to the NPML 
and ensemble estimates, Shen and Louis (1999) analyzed the baseball data 
from Efron and Morris (1973b). Batting averages are available for the first 
45 at-bats of the 1970 season for 18 major league baseball players. The full- 
season batting average plays the role of the true parameter value. Batting 
averages are arc-sine transformed to stabilize their variance, and then these 
are centered around the mean of the transformed data. As a result, 0 is the 
center of the transformed observed batting average data. 

The SBR starting distribution G 1 ^ is Uni f (-2.5.2.5). and is approxi¬ 
mated by 200 equi-probable, equally-spaced mass points. Ligure 7.1 shows 
the cdf for the MMLE normal prior and the NPML. Notice that the NPML 
is concentrated at two points (i.e., it is very "rough"). The estimated nor¬ 
mal prior is both smoothed and constrained to a narrow parametric form. 

Ligure 7.2 displays a sequence of roughened prior estimates along 
with the normal prior. Recall that the NPML is G 100 > The sequence of esti¬ 
mates roughens from its Uni f (-2.5, 2.5) starting point toward the NPML. 
Convergence is extremely slow. 

Linally, Ligure 7.3 displays various choices for estimating G k , including 
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-1 0 1 

X 


Figure 7.1 Estimated parametric and nonparametric priors for the baseball data. 


the actual function (the edf of the transformed season batting averages). 
The remaining estimates (see the figure legend) are 

• MLE: edf of the transformed, observed data 

• PEB: the posterior means based on a MMLE normal prior 

• NPEB: the posterior means based on the NPML prior 

• CEB: the Louis and Ghosh constrained estimates 

• SBR: G^ 20 ) 

Notice that the edf based on the MLE has a wider spread than that com¬ 
puted from the true parameter values. However, the posterior means from 
the PEB and the MPEB are shrunken too far toward the prior mean, and 
produce edfs with too small a spread. The CB approach gets the spread and 
shape about right, as does the SBR estimate from 20 iterations, though it 
produces a continuous distribution. Figure 3.1 presents another comparison 
of edfs based on the data (MLE), PEB and CB. 

The SBR estimate is included in Figure 7.3 to show its alignment with 
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Figure 7.2 Convergence of the smoothing-by-roughening (SBR) estimate from a 
uniform G^ toward the NPML using the EM algorithm. 


the true Gk even though it is not a candidate (since it is continuous and 
the target distribution is discrete). A discretized version similar to G j. (see 
equation 7.2) may be effective, but further evaluation is needed. 

The primary potential of the SBR estimate is as an estimate of G, not Gk- 
Such a smoothed estimate has the potential to produce robust and efficient 
(empirical) Bayes procedures. If (f°' > is chosen to be the prior mean of a 
hyperprior, the SBR approach is similar to use of a Dirichlet process prior. 
Indeed, employing this idea for estimation of the mean in a normal/normal 
two-stage model, setting j/j. = a + h log(Ar) exactly matches the estimate 
based on a parametric hyperprior. The parameters a and b capture the 
relative precision of the hyperprior and the data, and the number of iterates 
is always of order log(&). 

The SBR and Dirichlet process approaches have the potential to produce 
procedures that are robust to prior misspecification and efficient for a family 
of priors. They are especially attractive when the number of sampling units 


© 2000 by CRC Press LLC 





X 


Figure 7.3 Various estimates of Gk for the baseball data (see text for key to figure 
legend). 


k increases, but the information per unit does not (e.g., precision of the 
unit-specific MLEs stays constant). In these situations, if the parametric 
prior family is incorrect, procedures will have poor properties even for large 
k. Good performance for a broad range of loss functions will require that 
the SBR estimate (with i> k iterations) or the posterior distribution from the 
Dirichlet process is consistent for G. The hyperprior Bayesian approach 
is, in principle, more attractive, since the posterior process incorporates 
residual uncertainty. Both approaches require further study. 


7.2 Order restricted inference 

Another area for which Bayes and empirical Bayes methods are particularly 
well suited is that of order restricted statistical inference. By this we mean 
modeling scenarios which feature constraints on either the sample space 
(truncated data problems) or the parameter space (constrained parame- 
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ter problems). Such problems arise in many biostatistical settings, such as 
life table smoothing, bioassay, longitudinal analysis, and response surface 
modeling. While a great deal of frequentist literature exists on such prob¬ 
lems (see for example the textbook by Robertson, Wright, and Dykstra, 
1988), analysis is often difficult or even impossible since the order restric¬ 
tions typically wreak havoc with the usual approaches for estimation and 
testing. By contrast, the Bayesian approach to analyzing such problems is 
virtually unchanged: we simply add the constraints to our Bayesian model 
(likelihood times prior) in the form of indicator functions, and compute the 
posterior distribution of the parameter of interest as usual. But again, this 
conceptual simplicity could not be realized in practice before the advent of 
sampling based methodology. 

As an example, consider the problem of smoothing a series of raw mor¬ 
tality rates y, into a gradually increasing final sequence 0;, suitable for 
publication as a reference table. This process is referred to as "gradua¬ 
tion" in the actuarial science literature. A restricted parameter space arises 
since the unknown 0 = 0 2 , • •., d k ) T must lie in the set S = {0 : 

0\ < 0-2 <■■■ < 9k}. Under the prior distribution tt( 0|A), we have that 
p(y, 0|A) = /(y |0)7 t(0|A)/s( 0) where /s(0) is the indicator function of the 
set S, so that Is(6) equals 1 when 0 E S. and equals 0 otherwise. Gelfand, 
Smith, and Lee (1992) point out that the complete conditional distribution 
for, say, 0, satisfies 


p{9i\y, A, Oj^i) oc /(y|0)7r(0|A)/s, (#,-) , (7.6) 

where S’; is the appropriate cross-section of the constraint set S. In our 
increasing parameter case, for example, we have .5, = {#, : 0;_i < 9 t < 
0} + 1 }. Hence if tt is chosen to be conjugate with the likelihood/ - so that 
the unconstrained complete conditional distribution emerges as a familiar 
standard form, then the constrained version given in (7.6) is simply the 
same standard distribution restricted to 5,-. Carlin (1992) and Carlin and 
Klugman (1993) use this technology to obtain increasing and increasing 
convex graduations of raw mortality data for males over age 30. These 
papers employ vague prior distributions in order to maintain objectivity in 
the results and keep the prior elicitation burden to a minimum. 

Finally, Geyer and Thompson (1992) show how Monte Carlo methods 
may be used to advantage in computing maximum likelihood estimates 
in exponential families where closed forms for the likelihood function are 
unavailable. Gelfand and Carlin (1993) apply these ideas to arbitrary con¬ 
strained and missing data models, where integration and sampling is over 
the data space, instead of the parameter space. These ideas could be ap¬ 
plied to the problem of empirical Bayes point estimation in nonconjugate 
scenarios, where again the likelihood contains one or more intractable in¬ 
tegrals. 
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7.3 Nonlinear models 

As discussed in Chapter 3, the term "hierarchical model" refers to the 
extension of the usual Bayesian structure given in equation (2.1) to a hier¬ 
archy of 1 levels, where the joint distribution of the data and the parameters 
is given by 

/(y|0l)7Tl(0l|02)7T2(02|03) ' ' • 7T/(»/ |A), 
and we seek the marginal posterior of the first stage parameter p{6\ |y). 
In the case where / and the are normal (Gaussian) distributions with 
known variance matrices, this marginal posterior is readily available us¬ 
ing the results of Lindley and Smith (1972). However, more complicated 
settings generally require Monte Carlo methods for their solution. Gelfand 
et al. (1990) develop such techniques for arbitrary hierarchical normal lin¬ 
ear models using the Gibbs sampler. Several biostatistic al applications of 
such models have already appeared in the literature. Zeger and Karim 
(1991) apply the technology to a random effects model, and observe that 
high correlation between model parameters has a deleterious effect on algo¬ 
rithm convergence. Lange, Carlin, and Gelfand (1992) employ a hierarchical 
model for the longitudinal CD4 T-cell numbers of a cohort of San Francisco 
men infected with the HIV virus (see Sections 7.4 and 8.1 for an in-depth 
discussion of longitudinal data models). Their model allows for incomplete 
and unbalanced data, population covariates, heterogeneous variances, and 
errors-in-variables. 

In many applications, a linear model is inappropriate for the underlying 
mean structure. Our reasons for rejecting the linear model may be theoreti¬ 
cal (e.g., an underlying physical process) or practical (e.g., poor model fit). 
Fortunately, the Bayesian framework requires no major alterations to ac¬ 
commodate nonlinear mean structures. That is, our analytic framework is 
unchanged; only our computational algorithms must become more sophis¬ 
ticated (typically via Monte Carlo methods). The field of pharmacokinetics 
offers a large class of such models, since the underlying process of how a 
drug is processed by the body is known to follow a complex pattern that 
is nonlinear in the parameters (see e.g. Wakefield, 1996). 

As another example, consider the problem of bioassay, i.e., the determina¬ 
tion of an appropriate dose-response relationship from observed data. The 
example we shall use is modeling data from the Ames salmonella/microsome 
assay (Ames et al., 1973), perhaps the most widely used short term test for 
determining the mutagenic potency of certain chemical agents on strains 
of salmonella. The addition of a mutagenic agent to a plate of cultured mi¬ 
crobes results in the growth of mutant colonies, clearly visible to the naked 
eye; the number of such colonies is an indicator of the mutagenic potential 
of the agent. 

Krewski et al. (1993) review several biologically based models for Ames 
assay data. For a given agent-strain combination they suggest the following 
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model for the expected number of mutant colonies p as a function of dose: 

p(D\0o,0 l ,0 2 ,9) = (A> + 0iD)exp{-0 2 D 6 }, (7.7) 

where 9 > 1. The first factor reflects linear mutagenicity, while the second 
represents exponential toxicity. With mutagenic agents we typically have 0\ 
and 0 2 positive, so that the tendency to mutate increases with dose, but so 
does the danger of cell death. The expected number of mutant colonies p ( D ) 
generally increases with dose up to a certain point, after which it decreases, 
reflecting the effect of these two competing forces. Clearly 0q > 0 since it 
is the expected number of mutant colonies at dose zero. The constraint 
9 > 1 ensures that toxic effects are nonlinear and, in fact, negligible at low 
doses, so that the slope of the dose-response curve at D = 0, termed the 
mutagenic potency, is equal to 0i . 

For a particular agent-strain combination, let X, k j denote the observed 
number of mutant colonies observed at dose i (i = 1, . . ., /) in the A-th 
laboratory (k = 1, . . ., K ) on replication j (j = 1, ...,/). A parametric 
Bayesian approach requires specification of the distribution of these ob¬ 
served counts. Krewski et al. (1993) point out that the observed counts 
may be expected to follow a Poisson distribution with some extra-Poisson 
variation. Rather than specify the form of this distribution, they assume 
that the variance of the observed counts V at any dose satisfies V = p v , 
where p is the expected number of mutant colonies at the same dose. Es¬ 
timates of p and 6 are obtained from the data by regression techniques. 
Quasi-likelihood estimation (McCullagh and Nelder, 1989, Chapter 9) is 
subsequently employed to estimate the 0 t while keeping p and 9 fixed at 
their estimated values. 

We take account of extra-Poisson variability by allowing the Poisson 
parameter to vary randomly among replications and/or laboratories. A 
possible three-stage Bayesian hierarchical model is 

X ikj Po(\ ik ), 

where X ik = p{Di\0o, 0\ k , 0 2 ,6)\ (7-8) 

log(/?ifc) ~ N(p, t 2 ), 

where Po denotes the Poisson distribution and N denotes the normal 
(Gaussian) distribution. Normality of the log mutagenic potencies across 
laboratories was suggested by Myers et al. (1987). Prior distributions on 
9,0o, and 0 2 add to the model's second stage, while the priors on ij and 
r 2 form the third stage, completing the hierarchy. Although mutagenic po¬ 
tency is the quantity of primary interest, the models for 0q and 0 2 should 
perhaps be similar to that for 0\. As such, model (7.8) might be modified 
to incorporate random /?oj and 0 2k effects with little increase in conceptual 
and computational complexity. 

Model (7.8) allows explicit estimation of inter-laboratory variability in 
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mutagenic potency. But it does not permit estimation of intea-laboratory 
variability, and thus its comparison with inter-laboratory variability. In 
order to estimate these variance components, we can use a four-stage hier- 


archical model of the form 



Xikj 

ind 

rsj 

P 0 (^ikj), 

where A ik j 

= 

K D i-,0o,0ikj,02,Q); 

log(A kj) 

ind 

N(vk,r 2 ); 

Vk 

iid 

N(n,<r 2 ) . 


(7.9) 


In this case the intra-laboratory variability is quantified by r 2 , while the 
inter-laboratory variability is quantified by cr 2 . Concerns over the possible 
non-normality of the distributions of the r)k are mitigated by the fact that 
distributional changes at the third and higher stages of a hierarchy typically 
have limited impact on posterior distributions of second stage parameters 
(see Berger, 1985, p. 108). Identifying prior distributions on 0, 0q, 02, fi, 
t 2 , and <r" completes the Bayesian model specification. 

A principal purpose of many bioassays is to determine an agent's suit¬ 
ability as a reference standard. However, all of the models presented thus 
far are capable of handling only a single chemical, rather than both a new 
agent and a reference agent. When experimental observations on a known 
standard agent are available, we can extend model (7.8) to 


X%] Po(\ { £), where 

Ak = n(Di-,0( u) ,0[ u k \0( u) ,OW); 

log(^ } ) ~ V(7/"' 1 . (r 1 “ 1 ) 2 ) v 

and 

^ik] where 

log(/?^) A? N(rf‘\(rW) 2 ), 

where a ( u ) superscript denotes the unknown agent while an (,v) denotes 
the standard agent. Then the posterior distribution of j k = 0^/0^ sum¬ 
marizes the relative mutagenic potency of the unknown agent to that of the 
standard for lab k, k = 1, . . K. If all K of these posterior distributions 
are similar, one can conclude that the unknown agent is a useful reference 
standard. 

In a preliminary analysis of a single International Program on Chemi¬ 
cal Safety (IPCS) agent-strain combination (urban air particles and strain 
TA100-S9), Etzioni and Carlin (1993) assume independence among the 
four prior components and adopt familiar distributional forms, such as a 
normal density for M and gamma densities for the positive parameters 
V,0q , and 02 . They then obtain posterior means and standard deviations 
f° r 0o, 02, M, and V under the three-stage model (7.8). While this model 
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requires only elementary quadrature methods, it is too basic to answer all 
of the authors' research questions - most notably the intra- versus inter-lab 
variability question. Still, their Bayesian analysis contrasts markedly with 
the method of Krewski et al. (1993), who addressed these two estimation 
goals for the IPCS data in two separate steps. In the latter approach, for a 
given agent-strain combination a separate estimate of mutagenic potency 
was obtained for each repetition in each laboratory. Then analysis of vari¬ 
ance was used to decompose the variance of the potency estimates into 
within and between laboratory components. 

The more complicated nonlinear hierarchical models described above re¬ 
quire careful application of Monte Carlo methods. For example, in model 
(7.9) it is not possible to write the loglikelihood as a sum of low dimen¬ 
sional integrals. More explicitly, here we would define /3 = {/?i kj}, a vector 
of length KJ. If we could obtain closed forms for the integrals over the 
{Vk}, we would still have that kj and 0\kj' are marginally dependent 
given p. and <r 2 . Hence the likelihood would now at best be a product of 
/-dimensional integrals. Of course, in the case of hierarchical normal linear 
models, such integration is still feasible largely in closed form, thanks to 
the well-known work of Lindley and Smith (1972). In our heavily noncon¬ 
jugate scenario, however, no such convenient forms are available and, due 
to the model's high dimensionality, we would likely turn to Monte Carlo 
integration methods. Here, the nonconjugate structure present at the first 
two stages of model (7.9) precludes closed forms for the complete condi¬ 
tional distributions of the fiikj ■ Generation of these values would have to 
proceed via some sort of Metropolis-Hastings or other rejection algorithm 
(e.g., Gilks and Wild, 1992). 


7.4 Longitudinal data models 

A great many datasets arising in statistical and biostatistical practice con¬ 
sist of repeated measurements (usually ordered in time) on a collection 
of individuals. Data of this type are referred to as longitudinal data, and 
require special methods for handling the correlations that are typically 
present among the observations on a given individual. More precisely, sup¬ 
pose we let Yij denote the measurement on the i th individual in the 
study, j = 1, . . ., s, and i = 1, . . . , n. Arranging each individual's collection 
of observations in a vector Yi = (Yu ,..., Y lSt ) T , we can fit the usual fixed 
effects model 

)'i — Xj'O -f a , 

where X, is an s, x p design matrix of covariates and a is a p x 1 parameter 
vector. While the usual assumptions of normality, zero mean, and common 
variance might still be appropriate for the components of generally they 
are correlated with covariance matrix £. This matrix can be general or 
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follow some simplifying pattern, such as compound symmetry or a first- 
order autoregression. Another difficulty is that data frequently exhibit more 
variability than can be adequately explained using a simple fixed effects 
model. 

A popular alternative is provided by the mixed model 

Y i = X i a + Vr i p i + ei. l (7.10) 

where W, is an Si x q design matrix (q typically less than p), and 3 t is 
a q x 1 vector of subject-specific random effects, usually assumed to be 
normally distributed with mean vector 0 and covariance matrix V. The 
fas capture any subject-specific mean effects, and also enable the model to 
accurately reflect any extra variability in the data. Furthermore, it may now 
be realistic to assume that, given i3 t , the components of t. t are independent. 
This allows us to set 27 = a 2 1 5> (as in the original paper by Laird and 
Ware, 1982), although an autoregressive or other structure for £ may still 
be appropriate. Note that the marginal covariance matrix for Y t includes 
terms induced both by the random effects (from V) and the likelihood 
(from 27). 

The model (7.10) can accommodate the case where not every individ¬ 
ual has the same number of observations (unequal Sj), perhaps due to 
missed office visits, data transmission errors, or study withdrawal. In fact, 
features far broader than those suggested above are possible in principle. 
For example, heterogeneous error variances a 2 may be employed, removing 
the assumption of a common variance shared by all subjects. The Gaus¬ 
sian errors assumption may be dropped in favor of other symmetric but 
heavier-tailed densities, such as the Student's t. Covariates whose values 
change and growth curves that are nonlinear over time are also possible 
within this framework. 

An increasing number of commercially available computer packages (e.g., 
SAS Proc Mixed and BMDP5V)enable likelihood analysis for mixed mod¬ 
els; see Appendix Section C.2 for a discussion. Though these packages 
employ sophisticated numerical methods to estimate the fixed effects and 
covariance matrices, in making inferences they are unable to include the 
variance inflation and other effects of having estimated the features of 
the prior distributions. By contrast, MCMC methods are ideally suited 
to this goal, since they are able to perform the necessary integrations in 
our high-dimensional (and typically analytically intractable) hierarchical 
model. Several papers (e.g., Zeger and Karim, 1991; Lange et al., 1992; 
Gilks, Wang et al., 1993; McNeil and Gore, 1996) have employed MCMC 
methods in longitudinal modeling settings. Chib and Carlin (1999) provide 
a method for generating the fixed effects a and all the random effects /?, ' n 
a single block, which greatly improves the convergence of the algorithm. 

We require prior distributions for a, a 2 , and V to complete our model 
specification. Bayesians often think of the distribution on the random ef- 
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fects /?j as being part of the prior (the "first stage," or structural portion), 
with the distribution on V forming another part (the "second stage," or 
subjective portion). Together, these two pieces are sometimes referred to 
as a hierarchical prior. Again, the case study in Section 8.1 provides a 
real-data illustration. 


7.5 Continuous and categorical time series 


Many datasets arising in statistics and biostatistics exhibit autocorrelation 
over time that cannot be ignored in their analysis. A useful model in such 
situations is the state space model, given in its most general form by 

Xt = ft(xt-i) + U t , and (7 11) 

Vt = h t (x t ) + v t , 


where xt is the (unobserved) p x 1 state vector, y t is the (observed) q x 1 
observation vector, and _/)(■) and /*<(•) are known, possibly nonlinear func¬ 
tions which depend on some unknown parameters. Typically, and v t are 
taken as independent and identically distributed, with u t ~ N p { 0,£) and 
Vt ~ N q ( 0, T), where N p denotes the /^-dimensional normal distribution. 

State space models are particularly amenable to a Bayesian approach, 
since the time-ordered arrival of the data means that the notion of up¬ 
dating prior knowledge in the presence of new data arises quite naturally. 
Carlin, Poison, and Stoffer (1992) use Monte Carlo methods in an analysis 
of these models that does not resort to convenient normality or linearity 
assumptions at the expense of model adequacy. Writing the collection of 
unknown parameters simply as 6 for the moment, the likelihood for model 
(7.11) is given by 


n n 

L{d\x o, x, y) = p(x 0 ) g u {x t |x t _i, 6) JJ g v (y t \x t , 6) (7.12) 

t =l t=i 

for some densities g u and g v . Next, introducing the nuisance parameters 
A = (Ai,..., A n ) and u> = (uq,..., w n ) they assume that 

p(x t \x t -i, A t ,E) = N p (ft(x t -i), A t E), and 
p(y t \xt,u>t,T) = N g {h t (x t ),u>tT), 

where £ and T are parameter matrices of sizes p x p and q x q, respectively. 
Hence unconditionally. 


9u{xt\x t -\,T) = f A p(x t \x t - U A*,£)pi(A ( )dA t , and 

g v (y t \x t ,T) = J a p(y t \x t ,uj t) T)p2(io t )duj t , t=l,...,n. 

Note that, by varying pi(Aj) and p 2 (u>t), the distributions g u and g v are 
scale mixtures of multivariate normals for each t, thus enabling a wide 
variety of nonnormal error densities to emerge in (7.14). For example, in 
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the univariate case, the distributions g u and g v can be double exponential, 
logistic, exponential power, or t densities. 

As discussed just prior to Example 6.2, the conditioning feature of the 
Gibbs sampler makes such normal scale mixtures very convenient com¬ 
putationally. Non-Gaussian likelihoods may be paired with non-Gaussian 
priors and Gaussian complete conditionals may still emerge for the state 
parameters (in the present case, the x t s). Further, under conjugate priors 
the complete conditionals for E and T follow from standard Gaussian and 
Wishart distribution theory, again due to the conditioning on A and u>. 
Finally, the complete conditionals for the nuisance parameters themselves 
are of known functional form. For example, for our state space model we 
have p(A t | Aj,u>, E, T, y,x) = p(A ( |E, x t , * t -i) <x p(x t \x t -i, A t , E)pi(A t ) 
by Bayes' Theorem. But the normalizing constant is known by construc¬ 
tion: it is g u (xt i, E) from equation (7.14). Complete conditionals for 
the LJt would of course arise similarly. 

Thus, the sampling based methodology allows the solution of several 
previously intractable modeling problems, including the modeling of asym¬ 
metric densities on the positive real line (as might be appropriate for 
death rates), explicit incorporation of covariates (such as age and sex), 
heteroscedasticity of errors over time, multivariate analysis (including si¬ 
multaneous modeling of both preliminary and final mortality estimates), 
and formal model choice criteria for selecting the best model from many. 
West (1992) uses importance sampling and adaptive kernel-type density 
estimation for estimation and forecasting in dynamic nonlinear state space 
models. In the special case of linear state space models, Carter and Kohn 
(1994) provide an algorithm for generating all of the elements in the state 
vector simultaneously, obtaining a far less autocorrelated sampling chain 
(and hence faster convergence) than would be possible using the fully con¬ 
ditional algorithm of Carlin, Poison, and Stoffer (1992). See also Carter and 
Kohn (1996) for further improvements in the context of conditionally Gaus¬ 
sian state space models. Chib and Greenberg (1995b) extend this frame¬ 
work to allow time invariant parameters in the state evolution equation, as 
well as multivariate observations in a vector autoregressive or moving aver¬ 
age process of order 1. Marriott et al. (1996) use a hybrid Gibbs-Metropolis 
algorithm to analyze the exact ARMA likelihood (rather than that condi¬ 
tional on initial conditions), and incorporate the associated constraints on 
the model parameters to insure stationarity and invertibility of the process. 

Carlin and Poison (1992) extend model (7.11) further, formulating it for 
categorical data by adding the condition y t — A(y*) to the specification, 
where A(-) is a given many-to-one function. They also provide an example 
of biostatistical interest, applying the methodology to the infant sleep state 
study described by Stoffer et al. (1988). In this dataset, the experimenter 
observes only y t = 1 or 0, corresponding to REM or non-REM sleep, respec¬ 
tively, and the goal is to estimate the infant's underlying (continuous) sleep 
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state Xt . Simple density estimates for the x t are available both prior to the 
observation of y t (prediction) and afterward (filtering). Albert and Chib 
(1993b) use a similar approach to handle underlying autoregressive struc¬ 
ture of any order, and use the resulting hidden Markov model to advantage 
in analyzing U.S. GNP data. Chib (1996) provides a multivariate updat¬ 
ing algorithm for a collection of discrete state variables, analogous to the 
result of Carter and Kohn (1994) for continuous variables in a state space 
model. See also Chib and Greenberg (1998) for MCMC Bayesian methods 
for analyzing multivariate probit models, even more general approaches for 
correlated binary response data. 

7.6 Survival analysis and frailty models 

Bayesian methods have been a subject of increasing interest among re¬ 
searchers engaged in the analysis of survival data. For example, in mon¬ 
itoring the results of a clinical trial, the Bayesian approach is attractive 
from the standpoint of allowing the input of additional relevant informa¬ 
tion concerning stopping or continuing the trial. Spiegelhalter et al. (2000) 
provide a remarkably comprehensive overview of and guide to the literature 
on Bayesian methods in clinical trials and other areas of health technology 
assessment (e.g., observational studies). Areas covered include prior deter¬ 
mination, design, monitoring, subset analysis, and multicenter analysis; a 
guide to relevant websites and software is also provided. 

However, the Bayesian approach may be somewhat limited in its ability 
to persuade a broad audience (whose beliefs may or may not be charac¬ 
terized by the selected prior distribution) as to the validity of its results. 
In an attempt to remedy this, Chaloner et al. (1993) developed software 
for eliciting prior beliefs and subsequently demonstrate its application to 
a collection of Prelevant parties (medical professionals, policy makers, 
patients, etc.) in an attempt to robustify the procedure against outlying 
prior opinions. Modern computational methods allow real-time evaluation 
of the trial's stopping rule, despite the P-fold increase in the computational 
burden. 


7.6.1 Statistical models 

We now outline the essential features of a survival model. Suppose that 
associated with individual i in our study, i = 1,.... n. is a ^-dimensional 
covariate vector z, which does not change over time. We model the effect 
of these covariates using the familiar log-linear proportional hazards form, 
namely h(t |z) = h< } (t) exp(/3 r z), where /3 is a p-vector of unknown param¬ 
eters. Suppose that d failures are observed in the combined group over the 
course of the study. After listing the individuals in order of increasing time 
to failure or censoring, let z,• be the covariate vector associated with the i th 
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individual in this list, i = 1, . . n. Denote the j th risk set, or the collection 
of individuals still alive at the time of the j th death, by 7 Zj for j = 1, . . ., d. 
Then the Cox partial likelihood (Cox and Oakes, 1984, Chapter 7) for this 
model is given by 


m = n 

j~ 1 



(7.15) 


where (3 = (/?j,..., f3 p ) and zj = ( z\j ,.. -jZpj). it is often convenient to 
treat this partial likelihood as a likelihood for the purpose of computing 
the posterior density. A justification for this is offered by Kalbfleisch (1978), 
who places a gamma process prior on the baseline hazard function (inde¬ 
pendent of the prior for (3 ). and shows that the marginal posterior density 
for {3 approaches a form proportional to the Cox partial likelihood as the 
baseline hazard prior becomes arbitrarily diffuse. 

To complete the Bayesian model we must specify a joint prior distribution 
tt(/3) on the components of (3. Before doing so, however, we focus attention 
more clearly on the monitoring of treatment-control clinical trials. Suppose 
the variable of primary interest is the first component of z, Zi a binary 
variable indicating whether individual i has been randomly assigned to the 
control group (zi = 0) or the treatment group (.z i = 1). Hence interest 
focuses on the parameter 3\ and in particular on whether (3\ =0. Now 
returning to the question of prior elicitation for f3, we make the simplifying 
assumption that 7 r(/3) = tti(/? i)?r_ 1 (/3_ 1 ), where (3_ 1 = (/? 2 jv • •, /? P ). That 
is, f3i is a priori independent of the covariate-effect parameters f3_ 1 . In the 
absence of compelling evidence to the contrary, this a priori lack of inter¬ 
action between the treatment effect and covariates provides a reasonable 
initial model. Moreover, this prior specification does not force indepen¬ 
dence in the posterior, where the data determine the appropriate level of 
dependence. In addition, we assume a flat prior on the nuisance parame¬ 
ters, namely 7t_i(/3_ 1 ) = 1. This eliminates specifying the exact shape of 
the prior, and produces an evaluation of covariate effects comparable to 
that produced by traditional frequentist methods. Of course, if objective 
evidence on covariate effects is available, it can be used. However, since 
"standard practice" in patient care changes rapidly over time, such a prior 
should be fairly vague or include a between-study variance component. 


7.6.2 Treatment effect prior determination 

It remains to specify the (univariate) prior distribution on the treatment 
effect, 7Ti(/?i )• An algorithm and corresponding software for eliciting such a 
prior from an expert was developed by Chaloner et al. (1993); see Appendix 
Section C.l for details. While this method has been an operational boon to 
the Bayesian approach to this problem, its use of subjective expert opinion 
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as the sole source of evidence in creating the prior distribution on j3\ may 
constitute an unnecessary risk in many trials. Objective development of 
a prior for the control-group failure rate may be possible using existing 
data, since this value depends only on the endpoint selected, and not the 
particular treatment under investigation. Even so, eligibility, diagnostic and 
other temporal drifts limit the precision of such a prior. Data relevant for 
the treatment group are less likely to exist, but may be available in the 
form of previous randomized trials of the treatment. Failing this, existing 
non-randomized trial data might be employed, though prior uncertainty 
should be further inflated here to account for the inevitable biases in such 
data. 

A sensible overall strategy for prior determination in clinical trials is 
recommended by Spiegelhalter, Freedman, and Parmar (1994). These au¬ 
thors suggest investigating and reporting the posterior distributions that 
arise over a range of priors, in the spirit of sensitivity analysis (see Sub¬ 
section 6.1.1). For example, we might investigate the impact of a "clinical" 
prior, which represents the (typically optimistic) prior feelings of the trial's 
investigators, a "skeptical" prior, which reflects the opinion of a person or 
regulatory agency that doubts the treatment's effectiveness, and a "non- 
inf ormative" prior, a neutral position that leads to posterior summaries 
formally equivalent to those produced by standard maximum likelihood 
techniques. The specific role of the prior and the uncertainty remaining in 
the results after the data have been observed can then be directly ascer¬ 
tained. Appendix Subsection C.3.1 contains a brief description of iBART, a 
software program for implementing this approach with a normal or other 
survival likelihood for the treatment effect parameter. 


7.6.3 Computation and advanced models 

For the above models and corresponding priors, we require efficient algo¬ 
rithms for computing posterior summaries (means, variances, density es¬ 
timates, etc.) of the quantities of interest to the members of the clinical 
trial's data monitoring board. Naturally 0\ itself would be an example of 
such a quantity; the board might also be interested in the Bayes factor 
BF in favor of H (] : 3\ ^ 0, or perhaps simply the posterior probabil¬ 
ity that the treatment is efficacious, given that this has been suitably de¬ 
fined. For simplicity of presentation, in what follows we concentrate solely 
on the posterior of f3\. The simplest method for summarizing /3\ ’s poste¬ 
rior would be to use the first order (normal) approximation provided by 
Theorem 5.1. Such an approximation is extremely attractive in that it is 
computationally easy to produce quantiles of the posterior distribution, as 
well as a standardized Z-score for testing Hq : fti = p\°' 1 (for example, 
Z = (E(p\\TZ) — p[ 0) )/\/V(pi |7?.)). This statistic allows sequential graph- 
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ical monitoring displays similar to traditional frequentist monitoring plots 
(e.g. Lan and DeMets, 1983). 

If n is not sufficiently large, results appealing to asymptotic posterior 
normality will be inappropriate, and we must turn to more accurate meth¬ 
ods such as Laplace's method or traditional numerical quadrature, such 
as the Gauss-Hermite rules. In fact, if the dimension of /3 is small, it will 
be very convenient to simply discretize the entire vector (instead of just 
0i) and use a naive summation method, such as the trapezoidal rule. For 
high-dimensional (3, Monte Carlo integration methods will likely provide 
the only feasible alternative. 

These methods are especially convenient for an extension of model (7.15) 
known as a frailty model. Introduced by Vaupel, Manton, and Stallard 
(1979), this is a mixed model with random effects (called frailties) that 
correspond to an individual's health status. That is, we replace the Cox 
hazard function for subject i, h(t\zi,{3) = ho(t) exp(/3 T Zj), by 

h{t\zi,($, %•) = h 0 (t) 7 i exp(f3 T Zi) , 

where the frailties y i are typically assumed to be i.i.d. variates from a spec¬ 
ified distribution (e.g., gamma or lognormal). Clayton (1991) and Sinha 
(1993) show how the Gibbs sampler may be used to fit this model under 
an independent increment gamma process parametrization of the baseline 
hazard function. Gustafson (1994) and Sargent (1995) avoid this assump¬ 
tion by treating the Cox partial likelihood as a likelihood, and employing 
more sophisticated MCMC approaches. 

A nonnormal posterior distribution for precludes traditional monitor¬ 
ing plots based on standardized Z-scores. However, Bayesian monitoring 
may still proceed by plotting the posterior probabilities that 3 1 falls outside 
an "indifference zone" (or "range of equivalence"), u)- Recall from 

Subsection 6.1.2 that this approach, proposed by Freedman and Spiegelhal- 
ter (1983, 1989, 1992), assumes that for values within this prespecified 
range, we are indifferent as to the practical use of either the treatment or 
the control preparation. Since negative values of f3\ indicate an effective 
treatment, we can set ^ j =K < 0 and u = 0, an additional benefit 
perhaps being required of the treatment in order to justify its higher cost 
in terms of resources, clinical effort, or toxicity. The Bayesian decision rule 
terminates the trial (or more realistically, provides a guideline for termi¬ 
nation) when P(/3i > dj y \'JZ) is sufficiently small (deciding in favor of the 
treatment), or when P(/?i < di ,l j7?,] is sufficiently small (deciding in favor 
of the control). The case study in Section 8.2 provides an illustration. 

7.7 Sequential analysis 

As illustrated in equation (2.4), when data samples y x , y 2 ,..., y k arrive 
over time, Bayes' Rule may be used sequentially simply by treating the 
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current posterior as the prior for the next update, i.e., 

p(%i>- • -,Yk) oc fK{yK\0,yi, ■ ..,y K -i)p(0\yi,. ..,y K -i) ■ 

This formula is conceptually straightforward and simple to implement pro¬ 
vided all that is required is an updated posterior for 9 at each monitor¬ 
ing point. In many cases, however, the analyst faces a specific decision at 
each timepoint - say, a repeated hypothesis test regarding 0, the outcome 
of which may determine whether the experiment should be continued or 
not. For the frequentist, this raises the thorny issues of repeated signifi¬ 
cance testing and multiple comparisons. To protect the overall error rate 
a, many classical approaches resort to "a-spending functions", as in the 
aforementioned work by Lan and DeMets (1983). 

In the Bayesian framework, the problem is typically handled by com¬ 
bining the formal decision theoretic tools of Appendix B with the usual 
Bayesian probability tools. The traditional method of attack for such prob¬ 
lems is backward induction (see e.g. DeGroot, 1970, Chapter 12). a method 
which, while conceptually simple, is often difficult to implement when K 
is at all large. Subsection 7.7.1 describes a class of two-sided sequential 
decision problems (i.e., their likelihood, prior, and loss structures), while 
Subsection 7.7.2 outlines their solution via the traditional backward induc¬ 
tion approach. Then in Subsection 7.7.3, we present a forward sampling 
algorithm which can produce the optimal stopping boundaries for a broad 
subclass of these problems at far less computational and analytical ex¬ 
pense. In addition, this forward sampling method can always be used to 
find the best boundaries within a certain class of decision rules, even when 
backward induction is infeasible. We then offer a brief discussion, but defer 
exemplification of the methods until Subsection 8.2.4, as part of our second 
Chapter 8 case study. 

7.7. 1 Model and loss structure 

Carlin, Kadane, and Gelfand (1998) present the sequential decision problem 
in the context of clinical trial monitoring. Since this setting is very common 
in practice, and is one we have already begun to develop in Sections 6.1.2 
and 7.6, we adopt it here as well. 

Suppose then that a clinical trial has been established to estimate the 
effect 0 of a certain treatment. Information already available concerning 0 
is summarized by a prior distribution p(9). As patient accrual and followup 
is getting underway, the trial's data safety and monitoring board (hereafter 
referred to simply as "the board") agrees to meet on K prespecified dates 
to discuss the trial's progress and determine whether or not it should be 
continued. At each monitoring point k £ {1, .. . , If), some cost will have 
been invested in obtaining a new datapoint yj, having distribution f(yk \0)- 
enabling computation of the updated posterior distribution p(0\y\,..., yu). 
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It is convenient to think of k = 0 as an additional, preliminary monitoring 
point, allowing the board to determine whether the trial should be run at 
all, given only the prior and the set of costs {Ak} ■ 

Suppose that for the treatment effect parameter 0 £ SJJ t large negative val¬ 
ues suggest superiority of the treatment, while large positive values suggest 
superiority of the placebo (inferiority of the treatment). Suppose further 
that in conjunction with substantive information and clinical expert opin¬ 
ion, an indifference zone ( 0l,9jj ) can be determined such that the treat¬ 
ment is preferred for 0 < 0£ the placebo for 9 > $y and either decision is 
acceptable for 0 L < 0 < 9t;. 

At the final monitoring point ( K ), there are then two decisions available 
to the board: d\ (stop and decide in favor of the treatment), or d 2 (stop 
and decide in favor of the placebo). Corresponding to these two decisions, 
the indifference zone (0 L) 0 V ) suggests the following loss functions for given 
true values of 9 £ JJ: 

l[ K \d u 9) = s[ K) (0-0 u )+ , and l { 2 K \d 2 ,Q) = sf ) (0 L - 0) + , (7.16) 

where ,s\ h ' ] . > 0 and the "+" superscript denotes the "positive part" 

function, i.e., y+ = max(0, y ). That is, incorrect decisions are penalized, 
with the penalty increasing linearly as 0 moves further from the indifference 
zone. To facilitate the presentation of backward induction later in this sec¬ 
tion, for the moment we simplify the loss functions (7.16) by dropping the 
positive part aspect, resulting in linearly increasing negative losses (gains) 
for correct decisions. 

Given the posterior distribution of the treatment parameter at this fi¬ 
nal time point, p(9\iik), the best decision is that corresponding to the 
smaller of the two posterior expected losses, ! ~ E[l[ K \di,0)\yx] = 
s\ K) [E(0\y K ) - e v ], and = E[? 2 K) (d 2 ,0)\y K ] = s ( 2 K) [0 L - E(0\y K )]. 
If E(6\y^) exists, then clearly L\ K ' increases in E(0\yK) while de¬ 
creases in E{6\yn). The breakpoint between the optimality of the two 
decisions follows from equating these two expected losses. Specifically, de¬ 
cision di (decide in favor of the treatment) will be optimal if and only 
if 


E(0\y K ) < 7k = 


s[ K) 0u + s^’Ol 

s\ K) + s^ 


(K), 


(7.17) 


otherwise (when E{9\yK) > 7k), decision d 2 is optimal. Thus when s\ h 1 = 
^ 1 the breakpoint j K is simply the midpoint of the indifference zone. A 
similar result holds under the positive part loss functions in (7.16). 
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7.7.2 Backward induction 


To obtain the optimal Bayesian decision in the sequential setting at any 
time point other than the last, one must consider the probability of data 
values that have not yet been observed. This requires "working the prob¬ 
lem backward" (from time K back to time 0), or backward induction. To 
illustrate, consider time K - 1. There are now three available decisions: 
d\. d 2 , and d 3 (continue and decide after y K is available for analysis). To 
simplify the notation, suppose for now that K = 1, so that no data have 
yet accumulated and hence the current posterior p(9\yx-i) equals the prior 
p(0). The posterior expected losses are now — 9u], 

- sf~ l \0 L - E(8)], and 

L ’? K 1} = A k + J r( k) s[ K ’(9 - 8u)p(0\y K )m(y K )dOdy K 

+ /#(*> I -00 4 K) {8l ~ 8)p{8\y K )m(y K )dedy K , 

(7.18) 

where r[ k) = {y K ■ E(6\y K ) < 7 ^}, R ( 2 K) = {y K : E(0\y K ) > j K }, y K 
is as given in (7.17), and m{yx) is the marginal density of the (as yet 
unobserved) data value y K Once the prior p(0) and the sampling cost A K 
are chosen, the optimal decision is that corresponding to the minimum of 
1 E L^ K 11 and L‘.k 11 Alternatively, a class of priors (say, the normal 
distributions having fixed variance <r 2 ) might be chosen, and the resulting 
L\ ; plotted as functions of the prior mean //.. The intersection points 
of these three curves would determine stopping regions as a function of p, 
reminiscent of those found in traditional sequential monitoring displays. 

While the integration involved in computing is a bit awkward, it 

is not particularly difficult or high-dimensional. Rather, the primary hurdle 
arising from the backward induction process is the explosion in bookkeeping 
complexity arising from the resulting alternating sequence of integrations 
and minimizations. The number of Monte Carlo generations required to 
implement a backward induction solution also increases exponentially in 
the number of backward steps. 


7.7.3 Forward sampling 

The backward induction method's geometric explosion in both organiza¬ 
tional and computational complexity as K increases motivates a search 
for an alternative approach. We now describe a forward sampling method 
which permits finding the optimal sequential decision in a prespecified class 
of such decisions. We implement this method when the class consists of de¬ 
cision rules of the following form: at each step k, 

• for E(9\yi ,..., y^) < jk,L’ ma k e decision d\ (stop and choose the treat¬ 
ment); 
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• for E(6\y \,..., y^) > jk,u, make decision d 2 (stop and choose placebo); 

• for jic,L < E(0\yi ,..., yk) < 7 *^, make decision d;j (continue sampling). 
Carlin, Kadane, and Gelfand (1998) provide sufficient conditions to ensure 
that the optimal strategy is of this form; customary normal mean, binomial, 
and Poisson models meet these conditions. 

Note that 7 * 1 < Jk,u, with equality always holding for k = K (i.e., 
7 k,l = 7 k,U = Ik)- This special structure motivates the following forward 
sampling algorithm, which for comparison with Section 7.7.2 is illustrated 
in the case K = 2. Suppose an independent sample (0*, y* K _ l j, y* K ,), j = 
1, . . . , G is drawn via composition from the joint density p(0,yK-i, yx) — 
p(0)f{yK-\\Q)f{yK |$). When K = 2, decision rules can be indexed by the 
2K-1 = 5-dimensional vector 7 = {lK-2,L , JK-2.U, 7K-1.U, 7K )'■ 

Suppose that at the i th iteration of the algorithm, a current estimate 7 (l) 
of the optimal rule of this form is available. For each of the simulated 
parameter-data values, the loss incurred under the current rule can be 
computed as follows: 


If E(9)<4l 2tL , 
else if E{9) > 7k- 2 , 1 ;» 
else if E{9\y* K _ l j ) < 7k > _ 1 ,l. 
else if E(0\y* K _ 1J ) > 7 ^ ) _i, t/ , 
else if E(0\y* K _ l j ,y* K j ) < 7 ^, 


Lj = s{ K ~ 2 \0* - &u) (7.19) 

L j = s [ K ~ 2 \e L -e;) 

Lj=s[ K - l \0*-9 u) + A k ^ 1 
Lj = 4 JSf ~ 1) (^ -Oj) + A K -i 
Lj - s\ K \0* - 6u) + Ak-i + Ak 
L j — s { 2 h \$i — Oj) + Ak- 1 + Ak 


Then L = ^ Ylf=i Lj is a Monte Carlo estimate of the posterior expected 
loss under decision rule K If the current estimate is not the minimum 
value, it is adjusted to 7 (* +1 ) and the above process is repeated. At con¬ 
vergence, the algorithm provides the desired optimal stopping boundaries. 

Note that many of the y* K . values (and some of the y* K _ v values as well) 
will not actually figure in the determination of L, since they contribute only 
to the later conditions in the decision tree (7.19). These cases correspond 
to simulated trials that stopped before these later data values were actually 
observed; ignoring them in a given replication creates the proper marginal 
distribution of the parameter and the observed data (if any). Indeed, note 
that when E(0) £ (jk- 2 ,l, Jk- 2 ,u] all of the sampled values fall into one 
of the first two cases in (7.19). Second, L is not likely to be smooth enough 
as a function of 7 !’) to allow its minimization via standard methods, such 
as a quasi-Newton method. Better results may be obtained by applying 
a more robust, univariate method (such as a golden section search) se¬ 
quentially in each coordinate direction, i.e., within an iterated conditional 
modes algorithm (see Besag, 1986). 
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The main advantage of this approach over backward induction is clear: 
since the length of 7 is 2k - 1 , the computational burden in forward sam¬ 
pling grows linearly in K, instead of exponentially. The associated book¬ 
keeping and programming is also far simpler. Finally, the forward sampling 
algorithm is easily adapted to more advanced loss structures. For exam¬ 
ple, switching back to the positive part loss functions (7.16). a change that 
would cause substantial complications in the backward induction solution, 
leads only to a corresponding change in the scoring rule (7.19). 

A potential disadvantage of the forward sampling algorithm is that it re¬ 
places a series of simple minimizations over discrete finite spaces (a choice 
among 3 alternatives) with a single, more difficult minimization over a con¬ 
tinuous space One might well wonder whether this minimization 

is any more feasible for large K than backward induction. Fortunately, even 
when the exact minimum is difficult to find, it should be possible to find 
a rule that is nearly optimal via grid search over 7 -space, meaning that 
approximate solutions will be available via forward sampling even when 
backward induction is infeasible. 

It is critical to get a good estimate of each of the components of the 
optimizing 7 and not just the "components of interest"(typically, the 
early components of 7 near the current time point), for otherwise there is 
a risk of finding a local rather than a global minimum. As such, one may 
want to sample not from the joint distribution p(0, t/i, • • •, Uk) but from 
some modification of it designed to allow more realizations of the latter 
stages of the decision tree (7.19). To correct for this, one must then make 
the appropriate importance sampling adjustment to the Lj scores in 
By deliberately straying from the true joint distribution, one is sacrificing 
overall precision in the estimated expected loss in order to improve the 
estimate of the later elements of 7 . As such, the phrase "unimportance 
sampling" might be used to refer to this process. 

Discussion 

In the case of a one-parameter exponential family likelihood, forward sam¬ 
pling may be used to find the optimal rule based solely on the posterior 
mean. In a very general parametric case, the optimal sequential strategy 
could be a complex function of the data, leading to a very large class of 
strategies to be searched. But what of useful intermediate cases - say, a 
normal likelihood with both the mean and the variance unknown? Here, 
the class of possibly optimal sequential rules includes arbitrary functions 
of the sufficient statistic vector (y, s 2 ). In such a case, for the particular 
loss function in question, it may be possible to reduce the class of possi¬ 
ble optimal strategies. But for practical purposes it may not be necessary 
to find the optimal sequential strategy. Forward sampling can be used to 
evaluate general sequential strategies of any specified form. If the class of 
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sequential strategies offered to the algorithm is reasonably broad, the best 
strategy in that class is likely to be quite good, even if it is not optimal in 
the class of all sequential strategies. This constitutes a distinct advantage 
of the forward sampling algorithm. When backward induction becomes in¬ 
feasible, as it quickly does even in simple cases, it offers no advice about 
what sequential strategy to employ. Forward sampling, however, can be 
applied in such a situation with a restricted class of sequential strategies 
to find a good practical one even if it is not the best possible. 

7.8 Spatial and spatio-temporal models 

Statisticians are increasingly faced with the task of analyzing data that 
are geographically referenced, and often presented in the form of maps. 
Echoing a theme already common in this book, this increase has been 
driven by advances in computing; in this case, the advent of geographic 
information systems (GISs), sophisticated computer software packages that 
allow the simultaneous graphical display and summary of spatial data. But 
while GISs such as ARC/INFO, ArcView, and Maplnfo are extraordinarily 
powerful for data manipulation and display, as of the present writing, they 
have little capacity for statistical modeling and inference. 

To establish a framework for spatial modeling, we follow that given in 
the enormously popular textbook on the subject by Cressie (1993). Let 
s E be a location in J-dimensional space, and X(s) be a (possibly vector¬ 
valued) data value observed at s. The full dataset can then be modeled as 
the multivariate random process 

{X(s) : s G D} , (7.20) 

where s varies over an index set D C Cressie (1993) then uses (7.20) 
to categorize spatial data into three cases. 

• geostatistical data, where X(s) is a random vector at a location s which 
varies continuously over D, a fixed subset of that contains a d- 
dimensional rectangle of positive volume; 

• lattice data, where D is again a fixed subset (of regular or irregular 
shape), but now containing only countably many sites in :, ft d normally 
supplemented by neighbor information; 

• point pattern data, where now D is itself random; its index set gives the 
locations of random events that are the spatial point pattern. X(s) itself 
can simply equal 1 for all s £ D (indicating occurrence of the event), or 
possibly give some additional covariate information (producing a marked 
point pattern process ). 

An example of the first case might be a collection of measured oil or water 
reserves at various fixed source points s, with the primary goal being to 
predict the reserve available at some unobserved target location t. In the 
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second case, the observations might correspond to pixels on a computer 
screen, each of which has precisely four neighbors (up, down, left, and 
right), or perhaps average values of some variable over a geographic region 
(e.g., a county), each of whose neighbor sets contains those other counties 
that share a common border. Finally, point pattern data often arise as loca¬ 
tions of certain kinds of trees in a forest, or residences of persons suffering 
from a particular disease (which in this case might be supplemented by age 
or other covariate information, producing a marked point pattern). 

In the remainder of this section, we provide guidance on the Bayesian 
fitting of models for the first two cases above. While mining applications 
provided the impetus for much of the model development in spatial statis¬ 
tics, we find the term "geostatistical" somewhat misleading, since these 
methods apply much more broadly. Similarly, we dislike the term "lattice," 
since it connotes data corresponding to "corners" of a checkerboard-like 
grid. As such, we prefer to relabel these two cases more generically, accord¬ 
ing to the nature of the set D. Thus in Subsection 7.8.1 we consider models 
for geostatistical data, or what we call point source data models, where the 
locations are fixed points in space that may occur anywhere in D. Then in 
Subsection 7.8.2 we turn to lattice data models, or what we term regional 
summary data models, where the source data are counts or averages over 
geographical regions. 

In the S-plus environment familiar to many statisticians, the maps li¬ 
brary has for some time allowed display of point source data and re¬ 
gional summary data at the U.S. state or county level; more recently the 
S+SpatialStats add-on now enables easy access to many traditional geo¬ 
statistical, lattice, and point pattern analysis tools (see Kaluzny et al., 
1998). Our treatment is not nearly so comprehensive, but does suggest 
how Bayesian methods can pay dividends in a few of the most common 
spatial analysis settings. 

7.8.1 Point source data models 

We begin with some general notation and terminology. Let X(si) denote 
the spatial process associated with a particular variable (say, environmen¬ 
tal quality), observed at locations {s; : s, G D C A basic starting 

point is to assume that X(s 8 ) is a second-order stationary process, i.e., 
that i?[X(s)] = p and Cov[X(si), X(sj)] = C(s; — sj) < oo. C is called 
the covatiogram, and is analogous to the autocovariance function in time 
series analysis. It then follows that l/ar[A"(s,) — X(s ; )] = 27 (s^ — s j) is a 
function of the separation vector s,; — Sj alone, and is called the variogram; 
•y(s — sj) itself is called the semivariogram. The variogram is said to be 
isotropic if it is a function solely of the Euclidean distance between s, and 
Sj, dij ; in this case 2y(sj — Sj) = 2y (dij), so that C(s{ — s j) = C(dq). 

We define the sill of the variogram as lim^oo 27 (d), and the nugget as 
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lim^-fO 27 (d). The latter need not be zero (i.e., the variogram may have a 
discontinuity at the origin) due to microscale variability or measurement 
error (Cressie, 1993, p.59). If CXd) —► 0 as d—* 00 , then 27 (d) —»■ 2C(0), the 
sill. For a monotonic variogram that reaches its sill exactly, the distance at 
which this sill is reached is called the range; observations further apart than 
the range are uncorrelated. If the sill is reached only asymptotically, the 
distance such that Corr[A'(s,),X(Sj)] = 0.05 is called the effective range. 

If we further assume a second-order stationary Gaussian process, our 
model can be written as 

X(s,)=/i+^(sO + e(s,) , (7.21) 

where p is again the grand mean, W(s) accounts for spatial correlation, 
and e(s) captures measurement error, W(s) and e(s) independent and nor¬ 
mally distributed. To specify an isotropic spatial model, we might use a 
parametric covariance function, such as Cov(W{&f), fC(sj)) = a 2 p(dij; <f>). 
For example, we might choose p(dij\<f>) = exp(—^d !; ), <j> > 0, so that the 
covariance decreases with distance at constant exponential rate (j> ■ The mea¬ 
surement error component is then often specified as e(s) ~ N(0, r 2 ). From 
this model specification we may derive the variogram as 

2j(r 2 ,a 2 ,<p) = 2[r 2 + a 2 (l-p(d ij -,f))] , (7.22) 

so that t 2 is the nugget and r 2 + <r 2 is the sill. 

Despite the occasionally complicated notation and terminology, equa¬ 
tions (7.21) and (7.22) clarify that the preceding derivation determines 
only a four-parameter model. Bayesian model fitting thus proceeds once 
we specify the prior distribution for 6 — (p. r 2 , <7 2 , 0) standard choices 
for these four components might be flat, inverse gamma, inverse gamma, 
and gamma, respectively (the latter since there is no hope for a conjugate 
specification for <p). Since this model is quite low-dimensional, any of the 
computational methods described in Chapter 5 might be appropriate, in 
particular, one could easily obtain draws { 8^ g = 1 ,... G } fromp(0|X). 

Bayesian kriging 

While marginal posterior distributions such as p(<i>\x) might be of substan¬ 
tial interest, historically the primary goal of such models has been to pre¬ 
dict the unobserved value X(t) at some target location t. The traditional 
minimum mean squared error approach to this spatial prediction prob¬ 
lem is called kriging, a rather curious name owing to the influential early 
work of South African mining engineer D.G. Krige (see e.g. Krige, 1951). 
Handcock and Stein (1993) and Diggle et al. (1998) present the Bayesian 
implementation of kriging, which is most naturally based on the predictive 
distribution of the targets given the sources. More specifically, suppose we 
seek predictions at target locations (tj,. .., t m ) given (spatially correlated) 
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response data at source locations (sj,..., s n ). Following equations (7.21) 
and (7.22) above, the joint distribution of Xi = (X(si),..., A'(s n ))' and 
X 2 ■ (A'(t •.) A'(t,„))' is 

[(XuX2)'\ll,T 2 ,<T 2 ,<j>\~ MVNm+n (fjlm+n, ( ^ ^)), (7.23) 

where = C'ot)(Xj ; , X(), k,\ = 1,2 based on our isotropic model (e.g., 
the//th element of Sn is just < 7 2 exp(—<pdij) where d t] is the separation 
distance between s; and Sj and similarly for the other blocks). Thus using 
standard multivariate normal distribution theory (see e.g. Guttman, 1982, 
pp.70-72), the conditional distribution of the target values X 2 given the 
source values Xi and the model parameters is 

[X 2 | r 2 , a 2 ,<f>} ~ E 2 .i) . 

where 

A* 2.1 = m. + S 2 iS 11 1 (Xi — pin), and E 2 1 = E 22 — E 2 iE 11 1 Ei 2 . 

Thus, marginalizing over the parameters, the predictive distribution we 
seek is given by 

p(X 2 |X, ) = /p(X 2 |Xi, /i, t 2 , a 2 , <f>)p{p, t 2 , cr 2 , 0\X.i)dfidT 2 da 2 d(j) 

1 G 

« ^5>( X 2 | x i, (r 2 )f«J, (o- 2 )^ 1 . o'" 1 ) , (7.24) 

.9=1 

where (p^ , (r 2 f :l \ 0 <g ^) are the posterior draws obtained above. 

Point and interval estimates for the spatially interpolated values X (tj) 
then arise from the appropriate marginal summaries of (7.24). Recall from 
Chapter 5 that we need not actually do the mixing in (7.24): merely draw¬ 
ing X^ values by composition from p(X 2 |Xi, p( g \ (r 2 )^ g \ (<r 2 )^, <f>^) is 
sufficient to produce a histogram estimate of any component of p(X 2 |Xj). 

A nisotropy 

If the spatial correlation between A^(s«) and X(Sj ) depends not only on the 
length of the separation vector h, i; - = s* — Sj , but on its direction as well, 
the process X (s) is said to be anisotropic. A commonly modeled form of 
anisotropy is geometric range anisotropy, where we assume there exists a 
positive definite matrix B such that the variogram takes the form 

2r(h ij) = 2 7 ((h' J Bh !; ) 1 / 2 ) . 

Taking B— (f> 2 I produces (h' : -Bh^) 1 / 2 — ipd t] hence isotropy. Thus 
anisotropy replaces <j> by the parameters in B ; equation (7.22) generalizes 
to 

2y(r 2 ,a 2 , B) = 2[r 2 + <r 2 (l - /> ((hJ j Bh ij ) 1/2 ))] , (7.25) 
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so that r 2 and r 2 + er 2 are again the nugget and sill, respectively. 

In the most common spatial setting, Hi 2 , the 2x2 symmetric matrix B = 

( j 11 2 ) completely defines the geometric anisotropy, and produces 

\P 12 P22 / 

elliptical (instead of spherical) correlation contours in the plane having 
orientation u> defined implicitly by the relation 


cot(2w) - 


fill ~ P 22 
2/?12 


Also, the effective range in the direction 77 , r Tj . i s given by 


c 

rn = (h> v Bh v y/i ’ 

where h r/ = (cos 77 , sin 77 ), the unit vector in the direction of r/ and c is a 
constant such that 77 (c) = 0.05. Thus for our simple exponential p function 
above, we have c = - log(0.05)ft; 3. 

Ecker and Gelfand (1999) again recommend a simple product of in¬ 
dependent priors for the components of 8 = (//, r 2 , (T 2 , B), suggesting a 
Wishart(R, v) prior for B with small degrees of freedom 1 /. Again, while 
the notation may look a bit frightening, the result in Jr -2 is still only a 
6 -parameter model (up from 4 in the case of isotropy). Estimation and 
Bayesian kriging may now proceed as before; see Ecker and Gelfand (1999) 
for additional guidance on pre-analytic exploratory plots, model fitting, and 
a real-data illustration. See also Handcock and Wallis (1994) for a spatio- 
temporal extension of Bayesian kriging, an application involving temper¬ 
ature shifts over time, and related discussion. Le, Sun, and Zidek (1997) 
consider Bayesian multivariate spatial interpolation in cases where certain 
data are missing by design. 


7.8.2 Regional summary data models 

In this subsection we generally follow the outline of Waller et al. (1997), 
and consider spatial data (say, on disease incidence) that are available as 
summary counts or rates for a defined region, such as a county, district, or 
census tract. Within a region, suppose that counts or rates are observed 
for subgroups of the population defined by socio-demographic variables 
such as gender, race, age, and ethnicity. Suppose further that for each 
subgroup within each region, counts or rates are collected within regular 
time intervals, e.g. annually, as in our example. Hence in general notation, 
an observation can be denoted by y iit . Here i = 1, ..., / indexes the regions, 
1=1, ..., L indexes the subgroups, and 7 = 1, ..., T indexes the time periods. 
In application, subgroups are defined through factor levels, so that if we 
used, say, three factors, subscript l would be replaced by jkm. 

We assume ym is a count arising from a probability model with a base- 
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line rate and a region-, group-, and year-specific relative risk iput. Based on 
a likelihood model for the vector of observed counts y given the vector of 
relative risks t/>, and a prior model on the space of possible ips, MCMC com¬ 
putational algorithms can estimate a posterior for ip given y, which is used 
to create a disease map. Typically, a map is constructed by color-coding 
the means, medians, or other summaries of the posterior distributions of 
the rj, M . 

Typically, the objective of the prior specification is to stabilize rate esti¬ 
mates by smoothing the crude map, though environmental equity concerns 
imply interest in explaining the rates. The crude map arises from the like¬ 
lihood model alone using estimates (usually MLEs) of the xput based only 
on the respective y ilt . Such crude maps often feature large outlying rela¬ 
tive risks in sparsely populated regions, so that the map may be visually 
dominated by rates having very high uncertainty. Equally important, the 
maximum likelihood estimates fail to take advantage of the similarity of 
relative risks in adjacent or nearby regions. Hence, an appropriate prior 
for ip will incorporate exchangeability and/or spatial assumptions which 
will enable the customary Bayesian "borrowing of strength" to achieve the 
desired smoothing. 

The likelihood model assumes that, given ip, the counts ym are con¬ 
ditionally independent Poisson variables. The Poisson acts as an approxi¬ 
mation to a binomial distribution, say, ym Bin(n itt ,pu t ) where n at is 
the known number of individuals at risk in county i within subgroup £ at 
time t, and pm is the associated disease rate. Often the counts are suffi¬ 
ciently large so that ym or perhaps yjym (or the Freeman-Tukey transform, 
s/Vut + \Zvm + 1) can be assumed to approximately follow a normal den¬ 
sity. In our case, however, the partitioning into subgroups results in many 
small values of y ilt (including several Os), so we confine ourselves to the 
Poisson model, i.e. y M ~ Po(n M pm). 

We reparametrize to relative risk by writing Emi>ut = nutPut, where 
Em is the expected count in region i for subgroup l at time t. Thus our 
basic model takes the form 

yut ~ Po(Eutiput) ■ 

In particular, if p* is an overall disease rate, then rpm = pm/p* and Em = 
map* . The ps are said to be externally standardized ifp* is obtained from 
another data source, such as a standard reference table (see e.g. Bernar- 
dinelli and Montomoli, 1992). The ips are said to be internally standardized 
if p* is obtained from the given dataset, e.g., if p* = J2m Vitt/ TLm 
In our data example we lack an appropriate standard reference table, and 
therefore rely on the latter approach. Under external standardizing a prod¬ 
uct Poisson likelihood arises, while under internal standardizing the joint 
distribution of the ym is multinomial. However, since likelihood inference 
is unaffected by whether or not we condition on Vat ( see e S- Agresti, 


© 2000 by CRC Press LLC 



1990, p. 455-56), it is common practice to retain the product Poisson like¬ 
lihood. 

The interesting aspect in the modeling involves the prior specification 
for iput, or equivalently for p t j t = log ipuf Models for pm can incorporate 
a variety of main effect and interaction terms. Consider the main effect 
for subgroup membership. In our example, with two levels for sex and 
two levels for race, it takes the form of a sex effect plus a race effect plus 
a sex-race interaction effect. In general, we can write this main effect as 
St - xJ/3 for an appropriate design vector y , and parameter vector f3. 
In the absence of prior information we would take a vague, possibly flat, 
prior for (3. In many applications it would be sensible to assume additivity 
and write pm — £i + tit • For instance, in our example this form expresses 
the belief that sex and race effects would not be affected by region and 
year. This is the only form we have investigated thus far though more 
complicated structures could, of course, be considered. Hence, it remains 
to elaborate . 

Temporal modeling 

As part of e it , consider a main effect for time S t . While we might specify S t 
as a parametric function (e.g., linear or quadratic), we prefer a qualitative 
form that lets the data reveal any presence of trend via the set of posterior 
densities for the 6 t . For instance, a plot of the posterior means or medians 
against time would be informative. But what sort of prior shall we assign 
to the set {^}? A simple choice used in Section 8.3 is again a flat prior. 
An alternative is an autoregressive prior with say AR(1) structure, that is, 
~ N (y t + p(6t- 1 - 7 t-i) , (T 2 s ). We might take y t = yt, whence 7 
and <t| are hyperparameters. 

Spatial modeling 

The main effect for regions, say rji, offers many possibilities. For instance, 
we might have regional covariates collected in a vector zwhich contribute 
a component h(zi). Typically h would be a specified parametric function 
possibly linear, i.e., zj u>. In any event, a flat prior for a> would 
likely be assumed. In addition to or in the absence of covariate information 
we might include regional random effects 0, to capture heterogeneity among 
the regions (Clayton and Bernardinelli, 1992; Bernardinelli and Montomoli, 
1992). If so, since i arbitrarily indexes the regions, an exchangeable prior 
for the 6i s seems appropriate, i.e., 

Oi ~ N (k 9 , . (7.26) 

We then add a flat hyperprior for Kg , but require a proper hyperprior (typi- 
tally a gamma distribution) for r. Formally, with an improper prior such as 
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r (d- 3)/2 d= 0, 1,2, the resultant joint posterior need not be proper (Hobert 
and Casella, 1996). Pragmatically, an informative prior for t insures well 
behaved Markov chain Monte Carlo model fitting; see Subsection 8.3.3 for 
further discussion. 

Viewed as geographic locations, the regions naturally suggest the pos¬ 
sibility of spatial modeling (Clayton and Kaldor, 1987; Cressie and Chan, 
1989; Besag, York and Mollie 1991; Clayton and Bernardinelli, 1992; Ber- 
nardinelli and Montomoli, 1992). As with exchangeability, such spatial 
modeling introduces association across regions. Of course we should con¬ 
sider whether, after adjusting for covariate effects, it is appropriate for geo¬ 
graphic proximity to be a factor in the correlation among observed counts. 

If so, we denote the spatial effect for region i by 6, . In the spirit of Clay¬ 
ton and Kaldor (1987) and Cressie and Chan (1989) who extend Besag 
(1974), we model the fa using a conditional autoregressive (CAR) model. 
That is, we assume the conditional density of <j>i\<j>j^i is proportional to 
exp , where w 2i > 0 is a weight reflecting the 

influence of <f>j on the expectation of <p t , and a, > 0 is a " sample size 
associated with region i. From Besag (1974) we may show that the joint 
density of the vector of spatial effects 4> is proportional to exp B<j>) 
where Bn = a; and Bij = —cniVij, It is thus a proper multivariate normal 
density with mean 0 and covariance matrix AC 1 provided B is symmetric 
and positive definite. Symmetry requires a,; w t j — ajWji. A proper gamma 
prior is typically assumed for A. 

Two special cases have been discussed in the literature. One, exemplified 
in Cressie and Chan (1989), Devine and Louis (1994), and Devine, Halloran 
and Louis (1994), requires a matrix of interregion distances dij. Then Wij 
is set equal to g(dij) for a suitable decreasing function g. For example, 
Cressie and Chan (1989) choose g based on the estimated variogram of 
the observations. Since tty, = Wji, symmetry of B requires a,- constant, in 
fact, a,; = g(0), which without loss of generality we can set equal to 1. A 
second approach, as in e.g. Besag, York, and Mollie (1991), defines a set d, 
of neighbors of region i. Such neighbors can be defined as regions adjacent 
(contiguous) to region i, or perhaps as regions within a prescribed distance 
of region i. Let n, be the number of neighbors of region i and let uy, = 1/Vy 
if j g Q u and 0 otherwise. Then, if a t — ry. B is symmetric. It is easy to 
establish that 



where 0j = nj 1 g d ,<f>j . It is also clear that B is singular since the sum of 
all the rows or of all the columns is 0. Thus, the joint density is improper. 
In fact, it is straightforward to show that <f> T B(j> can be written as a sum 
of pairwise differences of neighbors. Hence, the joint density is invariant 
to translation of the <^s, again demonstrating its impropriety. The notions 
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of interregion distance and neighbor sets can be combined as in Cressie 
and Chan (1989) by setting Wij = g(djj) for j (= di. In the exercises below 
and our Section 8.3 case study, however, we confine ourselves to pairwise 
difference specification based upon adjacency, since this requires no inputs 
other than the regional map, and refer to such specifications as CAR( A) 
priors. 

Typically the impropriety in the CAR prior is resolved by adding the 
sum-to-zero constraint Ya=i fa = 0, which is awkward theoretically but 
simple to implement "on the fly" as part of an MCMC algorithm. That is, 
we simply subtract the current mean y 4^^ f rom all of the o 3> at the 

end of each iteration g. Again, adding a proper hyperprior for A, proper 
posteriors for each of the o, ensue. The reader is referred to Eberly and 
Carlin (2000) for further insight into the relationship among identifiability, 
Bayesian learning, and MCMC convergence rates for spatial models using 
CAR( A priors. 

Finally, note that we have chosen to model the spatial correlation in the 
prior specification using, in Besag's (1974) terminology, an auto-Gaussian 
model. One could envision introducing spatial correlation into the like¬ 
lihood, i.e., directly amongst the y,j t , resulting in Besag's auto-Poisson 
structure. Flowever, Cressie (1993) points out that this model does not 
yield a Poisson structure marginally. Moreover, Cressie and Chan (1989) 
observe the intractability of this model working merely with y, t s; for a set 
of yu t s, model fitting becomes infeasible. 

A general form for the regional main effects is thus = h(z ,; u>) + 
0{ + <j)i. This form obviously yields a likelihood in which 9, and <p i are 
not identifiable. However, the prior specification allows the possibility of 
separating these effects; the data can inform about each of them. A related 
issue is posterior integrability. One simple condition is that any effect in 
Hilt which is free of i and has a flat prior must have a fixed baseline level, 
such as 0. For example, Hut cannot include an intercept. Equivalently, if 
both <j>i and 0 t are in the model then we must set = 0. Also, if S t terms 
are included we must include a constraint (say, = 0 or S t — 0), for 
otherwise we can add a constant to each and subtract it from each level 
of the foregoing effect without affecting the likelihood times prior, hence 
the posterior. It is also important to note that Q i and <p, may be viewed 
as surrogates for unobserved regional covariates. That is, with additional 
regional explanatory variables, they might not be needed in the model. 
From a different perspective, for models employing the general form for ^ ) 
there may be strong collinearity between Zj and say which again makes 
the (pi difficult to identify, hence the models difficult to fit. 
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Spatio-temporal interaction 

To this point, pm consists of the main effects £/, 6 t , and 77 ,-. We now turn 
to interactions, though as mentioned above the only ones we consider are 
spatio-temporal. In general, defining a space-time component is thorny be¬ 
cause it is not clear how to reconcile the different scales. Bernardinelli, 
Clayton et al. (1995) assume the multiplicative form (uf + vf)t for the 
heterogeneity by time and clustering by time effects. We again prefer a 
qualitative form and propose a nested definition, wherein heterogeneity 
effects and spatial effects are nested within time. In this way we can ex¬ 
amine the evolution of spatial and heterogeneity patterns over time. We 
write these effects as 0^ and <j>f\ That is, s\ 1 ' is a random effect for 
the i th region in year t. Conditional exchangeability given t would lead 
us to the prior 6^ N(k^\ 1/t<). Similarly, <j>^ is viewed as a spa¬ 

tial effect for the i tl1 region in year t. Again we adopt a CAR prior but 
now given t as well. Hence, we assume has density proportional 

to exp j^— • Again, proper gamma priors are as¬ 

sumed for the r t and the A<. Such definitions for and <j>^ preclude the 
inclusion of main effects 9i and <p l in the model. Also, if <p\ t] appears in 
the model, then we cannot include S t terms that lack an informative prior, 
since for any fixed t the likelihood could not identify both. Similarly, if both 
9^ and <f> appear in the model, we must set = 0. 

In summary, the most general model we envision for pu t takes the form 

PM = X-JP + zjV + + (f)^ . (7.28) 

In practice, (7.28) might well include more effects than we need. In our 
Section 8.3 case study, we investigate more parsimonious reduced models. 

7.9 Exercises 

1. Let Gk (t) be defined as in Subsection 7.1.1. Show that the posterior 
mode of Gk(t), denoted Gk{t), has equal mass at exactly k mass points 
that are derived from an expression which for each t is the likelihood of 
independent, non-identically distributed Bernoulli variables. 

2. Let G* k be a discrete distribution with equal mass at exactly k mass 
points, to be used as an estimate of Gk- Prove that if the Yj are i.i.d., 
then, under integrated squared-error loss and subject to the elf of the 
ensemble producing G\, the best estimates of the individual 0j are the 
mass points themselves, with order assigned as for the observed Yj. Also, 
show that the optimal ranks for the 9j are produced by the ranked Yj . 

3. Compute and compare the mean and variance induced by edfs formed 
from the coordinate-specific MLEs, standard Bayes estimates, and func- 
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tion Gk for the normal/normal. Poisson/gamma, and beta/binomial two- 
stage models. 

4. (Devroye, 1986, p.38) Suppose X is a random variable having cdf F, and 
Y is a truncated version of this random variable with support restricted 
to the interval [a, b]. Then Y has cdf 

( 0 , y < a 

G (y ) = { 7$) -F(a) - a<y<b 

[ l, y>b 

Show that Y can be generated as F~ 1 (F(a ) + U[F(b) — T’(a)]), where 
U is a Unif( 0, 1) random variate. (This result enables "one-for-one" 
generation from truncated distributions for which we can compute F 
and F~ l either exactly or numerically.) 

5. Consider the estimation of human mortality rates between ages x and 
x + k, where x > 30. Data available from a mortality study of a group 
of independent lives includes d r the number of deaths observed in the 
unit age interval [x + i — 1, x + i], and a, the number of person-years 
the lives were under observation ( exposed ) in this interval. Thinking of 
the e t as constants and under the simplifying assumption of a constant 
force of mortality 9{ over age interval i, one can show that 

~ PoieA), 

An estimate of Q i is thus provided by r, = di /e*, the unrestricted maxi¬ 
mum likelihood estimate of 6i , more commonly known as the raw mor¬ 
tality rate. We wish to produce a graduated (smoothed) sequence of 0 {s 
which conform to the increasing condition 

8 g S INC = {8: 0 < 6»i < • • • < 0 k < B} 
for some fixed positive B. 

Assume that the 9 t s are an i.i.d. sample from a G(a,f3) distribution 
before imposing the above constraints, where a is a known constant and 
[3 has an IG(a, b ) hyperprior distribution. Find the full conditionals for 
the 9i and !3 needed to implement the Gibbs sampler in this problem. 
Also, describe how the result in the previous problem would be used in 
performing the required sampling. 

6. Table 7.1 gives a dataset of male mortality experience originally pre¬ 
sented and analyzed by Broffitt (1988). The rates are for one-year inter¬ 
vals, ages 35 to 64 inclusive (i.e., bracket i corresponds to age 34 + i at 
last birthday). 

(a) Use the results of the previous problem to obtain a sequence of gradu¬ 
ated, strictly increasing mortality rate estimates. Set the upper bound 
B = .025 and use the data-based (empirical Bayes) hyperparameter 
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i 

di 

e* 

n 

i 

d% 

€i 

fi 

i 

3 

1771.5 

0.0016935 

16 

4 

1516.0 

0.0026385 

2 

1 

2126.5 

0.0004703 

17 

7 

1371.5 

0.0051039 

3 

3 

2743.5 

0.0010935 

18 

4 

1343.0 

0.0029784 

4 

2 

2766.0 

0.0007231 

19 

4 

1304.0 

0.0030675 

5 

2 

2463.0 

0.0008120 

20 

11 

1232.5 

0.0089249 

6 

4 

2368.0 

0.0016892 

21 

11 

1204.5 

0.0091324 

7 

4 

2310.0 

0.0017316 

22 

13 

1113.5 

0.0116749 

8 

7 

2306.5 

0.0030349 

23 

12 

1048.0 

0.0114504 

9 

5 

2059.5 

0.0024278 

24 

12 

1155.0 

0.0103896 

10 

2 

1917.0 

0.0010433 

25 

19 

1018.5 

0.0186549 

11 

8 

1931.0 

0.0041429 

26 

12 

945.0 

0.0126984 

12 

13 

1746.5 

0.0074435 

27 

16 

853.0 

0.0187573 

13 

8 

1580.0 

0.0050633 

28 

12 

750.0 

0.0160000 

14 

2 

1580.0 

0.0012658 

29 

6 

693.0 

0.0086580 

15 

7 

1467.5 

0.0047700 

30 

10 

594.0 

0.0168350 



Table 7.1 

Raw mortality data (Broffitt, 1988 ) 

values 

a 

— d = r 

2 /(4 - r£? s 

:i e « 

7 *), 

a = 3.0, 

and b — d/(2r), 

where 

r - 

- V* r 

~ / — 1 ’l 

/k and si = 

r,L 

i(r t - 

- r) 2 /(k - 

- 1). These values 


of a and b correspond to a rather vague hyperprior having mean and 
standard deviation both equal to 0 = f /a. (In case you've forgotten, 
a and (3 are the method of moments EB estimates derived in Chapter 
3 , Problem 9.) 

(b) Do the graduated rates you obtained form a sequence that is not 
only increasing, but convex as well? What modifications to the model 
would guarantee such an outcome? 

7. Warning: In this problem, we use the BUGS parametrizations of the nor¬ 
mal and gamma distributions: namely, a N(p, r) has precision (not vari¬ 
ance) t, and a G(a, 0) has mean a//3 and variance a/0 2 . 

Consider the data in Table 7.2, also available on the web in WinBUGS 
format at http: //www. biostat. umn. edu/-brad/data—html Originally 
presented by Wakefield et al. (1994), these data record the plasma con¬ 
centration Yij of the drug Cadralazine at various time lags Xij following 
the administration of a single dose of 30 mg in 10 cardiac failure patients. 
Here, i = 1, .... 10 indexes the patient, while j = 1, . . . , n,- indexes the 
observations, 5 < rii < 8 . These authors suggest a "one-compartment" 
nonlinear pharmacokinetic model wherein the mean plasma concentra- 
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no. of hours following drug administration 


patient 

2 

4 

6 

8 

10 

24 

28 

32 

1 

1.09 

0.75 

0.53 

0.34 

0.23 

0.02 

- 

- 

2 

2.03 

1.28 

1.20 

1.02 

0.83 

0.28 

- 

- 

3 

1.44 

1.30 

0.95 

0.68 

0.52 

0.06 

- 

- 

4 

1.55 

0.96 

0.80 

0.62 

0.46 

0.08 

- 


5 

1.35 

0.78 

0.50 

0.33 

0.18 

0.02 

- 

- 

6 

1.08 

0.59 

0.37 

0.23 

0.17 

- 

- 

- 

7 

1.32 

0.74 

0.46 

0.28 

0.27 

0.03 

0.02 

- 

8 

1.63 

1.01 

0.73 

0.55 

0.41 

0.01 

0.06 

0.02 

9 

1.26 

0.73 

0.40 

0.30 

0.21 

- 

- 

- 

10 

1.30 

0.70 

0.40 

0.25 

0.14 

- 

- 

- 


Table 7.2 Cadralazine concentration data (Wakefield et ah, 1994) 

tion rjij(xij) at time x tJ is given by 

Vij(*i}) = 30a- 1 exp(—• 

Subsequent unpublished work by these same authors suggests this model 
is best fit on the log scale. That is, we suppose 

Zij - log Yij = log rnj(xij) + dj , 

where A ? (0. r,). The mean structure for the Z t] 's thus emerges as 

log Vij(Xij) = log [30QT 1 exp(—/%*ij/ori)] 

= log 30 - log a,- - Pi Xij /a; 

= log 30 - a; - exp (6; - afxij , 
where a, = logo, and h l = log Pi. 

(a) Assuming the two sets of random effects are independently distributed 

as a,- N(n a ,T a ) and 6, N(m, n), use BUGS or WinBUGS to ana¬ 
lyze these data (see Subsection C.3.3 in Appendix C, or go directly 
to the program's webpage, http://www.mrc-bsu.cam.ac.uk/bugs/) 
Adopt a vague prior structure, such as independent G(0.0001, 0.0001) 
priors for the r,-, independent N(0, 0.0001) priors for fj. a and and 
G( 1, 0.04) priors for r a and rj (the latter as encouraged the original 
authors). 

Note that the full conditional distributions of the random effects are 
not simple conjugate forms nor guaranteed to be log-concave, so BUGS' 
Metropolis capability is required. Besides the usual posterior sum¬ 
maries and convergence checks, investigate the acceptance rate of the 
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l 


XI i 

Vi 

l 

X\i 

XU 

Vi 

1 

3.7 

0.825 

1 

21 

0.4 

2 

0 

2 

3.5 

1.09 

1 

22 

0.95 

1.36 

0 

3 

1.25 

2.5 

1 

23 

1.35 

1.35 

0 

4 

0.75 

1.5 

1 

24 

1.5 

1.36 

0 

5 

0.8 

3.2 

1 

25 

1.6 

1.78 

1 

6 

0.7 

3.5 

1 

26 

0.6 

1.5 

0 

7 

0.6 

0.75 

0 

27 

1.8 

1.5 

1 

8 

1.1 

1.7 

0 

28 

0.95 

1.9 

0 

9 

0.9 

0.75 

0 

29 

1.9 

0.95 

1 

10 

0.9 

0.45 

0 

30 

1.6 

0.4 

0 

11 

0.8 

0.57 

0 

31 

2.7 

0.75 

1 

12 

0.55 

2.75 

0 

32 

2.35 

0.03 

0 

13 

0.6 

3 

0 

33 

1.1 

1.83 

0 

14 

1.4 

2.33 

1 

34 

1.1 

2.2 

1 

15 

0.75 

3.75 

1 

35 

1.2 

2 

1 

16 

2.3 

1.64 

1 

36 

0.8 

3.33 

1 

17 

3.2 

1.6 

1 

37 

0.95 

1.9 

0 

18 

0.85 

1.415 

1 

38 

0.75 

1.9 

0 

19 

1.7 

1.06 

0 

39 

1.3 

1.625 

1 

20 

1.8 

1.8 

1 






Table 7.3 Finney's vasoconstriction data 


Metropolis algorithm, and the predictive distribution of Y 4 7 , the miss¬ 
ing observation on patient 4 at 28 hours after drug administration. 
Are any patients "outliers" in some sense? 

(b) The assumption that the random effects a, and b l are independent 
within individuals was probably unrealistic. Instead, follow the orig¬ 
inal analysis of Wakefield et al. (1994) and assume the 0; = (a,, bi)' 
are i.i.d. from a distribution, where p. — (p a ,Pb)- Again 

adopt a corresponding vague conjugate prior specification, namely 
H ~ N-z{0, C) and Q ~ Wishart((pf?) _ 1 ,p) with C k. 0, p = 2, and 
R = Diag(0.04, 0.04). 

Describe the changes (if any) in your answers from those in part (a). 
Summarize your experience with WinBUGS . Did it perform as indicated 
in the documentation? What changes would you suggest to improve 
its quality, generality, and ease of use in applied data analysis settings? 

8 . Suppose we have a collection of binary responses yi £ {0, 1}, i = 1,..., n, 
and associated ^-dimensional predictor variables x;. Define the latent 


© 2000 by CRC Press LLC 



variables y* as 


V* = xf/3 + 6, : , i = 1,.. ,,n , 

where the e,- are independent mean-zero errors having cumulative dis¬ 
tribution function F, and /3 is a k-dimensional regression parameter. 
Consider the model 

r o, if y * < o 
‘ _ l i, if > o • 

(a) Show that if F is the standard normal distribution, this model is 
equivalent to the usual probit model for p t = P(Y t = 1). 

(b) Under a N(fi, £) prior for f3, find the full conditional distributions 
for /3 and the Y* i =1, n. 

(c) Use the results of the previous part to find the posterior distribution 
for /3 under a flat prior for the data in Table 7.3. Originally given 
by Finney (1947), these data give the presence or absence of vaso¬ 
constriction in the skin of the fingers following inhalation of a certain 
volume of air, aq, at a certain average rate, Xo. (Include an intercept 
in your model, so k = 3.) 

(d) How would you modify your computational approach if, instead of the 
probit model, we wished to fit the logit (logistic regression) model? 

(e) Perform the actual MCMC fitting of the logistic and probit models 
using the BUGS or WinBUGS language. Do the fits differ significantly? 
(Hint: Follow the approach of Chapter 5, problem 13.) 


9 . Suppose that in the context of the previous problem, the data y i are 
not merely binary but ordinal, i.e., y t g {0,1 .... to}, i = 1,.. ., n. We 
extend our previous model to 



0 , 

1 , 


TO — 1, 


if Yf < 7i 

if 7 i < Y* < 72 


if 7 m _i < Y* < 7 m 
if Y? > 7 m 


(a) Suppose we wish to specify G as the cdf of Yi (e.g., G ~ Bin(m,pj)). 
Find an expression for the appropriate choice of "bin boundaries" 
7 j j = 1, ... to, and derive the new full conditional distributions for 
f3 and the Y* in this case. 

(b) Now suppose instead that we wish to leave G unspecified. Can we 
simply add all of the jj s into the sampling order and proceed? What 
are the appropriate full conditional distributions to use for the y.: ? 


10. The data in Table 7.4 were originally analyzed by Tolbert et al. (2000), 
and record the horizontal sq and vertical s 2 coordinates (in meters 
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station 

i 

latitude 

Sli 

longitude 

®2 i 

observed data 
*(*) 

i 

-18.9805 

359.3649 

0.085 

2 

-16.9316 

359.1380 

0.069 

3 

-15.1309 

357.1999 

0.084 

4 

-12.5863 

356.9081 

0.104 

5 

-12.0162 

356.5347 

0.111 

6 

-9.9322 

355.4129 

0.069 

7 

-9.7132 

364.1625 

0.049 

8 

-13.1500 

350.8925 

0.053 

9 

-11.2352 

358.3001 

0.082 

10 

-9.8524 

359.5717 

0.065 

A 

-13.1383 

356.5103 

- 

B 

-13.9711 

356.0425 

- 


Table 7.4 Ozone data at ten monitoring stations, Atlanta metro area, 6/3/95 


X10 4 ) of ten ozone monitoring stations in the greater Atlanta metro 
area. Figure 7.4 on page 274 shows the station locations on a greater 
Atlanta zip code map. The table also gives the ten corresponding ozone 
measurements A(s*) (one-hour max) for June 3, 1995. 

(a) Using the isotropic Bayesian kriging model of Section 7.8.1 applied 
to Z(Si) = log A(s,). find the posterior distribution of the spatial 
smoothing parameter <f) and the true ozone concentrations at target 
locations A and B, whose coordinates are also given in Table 7.4. 
(As can be seen from Figure 7.4, these are two points on opposite 
sides of the same Atlanta city zip code.) Set x 1 = 0, and use a flat 
prior on fi, a vague IG(3, 0.5) prior (mean 1, variance 1) on cr 2 , and a 
vague G(. 001,100) (mean .1, variance 10) prior on <fi. Are your results 
consistent with the mapped source data? 

(b) Why might an anisotropic model be more appropriate here? Fit this 
model, using a vague Wishart prior for the matrix B, and compare 
its fit to the isotropic model in part (a). 

11. Table 7.5 presents observed (Y,) and expected (E t ) cases of lip can¬ 
cer in 56 counties in Scotland, which have been previously analyzed by 
Clayton and Kaldor (1987) and Breslow and Clayton (1993). The table 
also shows a spatial covariate, x the percentage of each county's pop¬ 
ulation engaged in agriculture, fishing, or forestry (AFF), the observed 
standardized mortality ratio (SMR) for the county, 100Y;/ E{, and the 
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i 

Yt 

Ei 

Xi 

SMR 

Adjacent counties (di) 

i 

9 

1.4 

16 

652.2 

5, 9, 11, 19 

2 

39 

8.7 

16 

450.3 

7, 10 

3 

11 

3.0 

10 

361.8 

6, 12 

4 

9 

2.5 

24 

355.7 

18, 20, 28 

5 

15 

4.3 

10 

352.1 

1, 11, 12, 13, 19 

6 

8 

2.4 

24 

333.3 

3, 8 

7 

26 

8.1 

10 

320.6 

2, 10, 13, 16, 17 

8 

7 

2.3 

7 

304.3 

6 

9 

6 

2.0 

7 

303.0 

1, 11, 17, 19, 23, 29 

10 

20 

6.6 

16 

301.7 

2, 7, 16, 22 

11 

13 

4.4 

7 

295.5 

1, 5, 9, 12 

12 

5 

1.8 

16 

279.3 

3, 5, 11 

13 

3 

1.1 

10 

277.8 

5, 7, 17, 19 

14 

8 

3.3 

24 

241.7 

31, 32, 35 

15 

17 

7.8 

7 

216.8 

25, 29, 50 

16 

9 

4.6 

16 

197.8 

7, 10, 17, 21, 22, 29 

17 

2 

1.1 

10 

186.9 

7, 9, 13, 16, 19, 29 

18 

7 

4.2 

7 

167.5 

4, 20, 28, 33, 55, 56 

19 

9 

5.5 

7 

162.7 

1, 5, 9, 13, 17 

20 

7 

4.4 

10 

157.7 

4, 18, 55 

21 

16 

10.5 

7 

153.0 

16, 29, 50 

22 

31 

22.7 

16 

136.7 

10, 16 

23 

11 

8.8 

10 

125.4 

9, 29, 34, 36, 37, 39 

24 

7 

5.6 

7 

124.6 

27, 30, 31, 44, 47, 48, 55, 56 

25 

19 

15.5 

1 

122.8 

15, 26, 29 

26 

15 

12.5 

1 

120.1 

25, 29, 42, 43 

27 

7 

6.0 

7 

115.9 

24, 31, 32, 55 

28 

10 

9.0 

7 

111.6 

4, 18, 33, 45 


Table 7.5 Observed SMRs for Lip Cancer in 56 Scottish Counties 


the spatial relation of each county to the others via the indices of those 
counties adjacent to it. The raw data and the AFF covariate are mapped 
in Figure 7.5 on page 274. 

(a) Following Section 7.8.2, fit the model Tj ~ Po{ P t eP') where 

Hi = axi /10 + 0i 

and the 0 t have the exchangeable prior (7.26). Estimate the posterior 
distributions of Kg and a, and comment on your findings. 
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i Yi Ei Xi SMR Adjacent counties ( d t ) 

9, 15, 16, 17, 21, 23, 25, 26, 34, 


29 

16 

14.4 

10 

111.3 

30 

11 

10.2 

10 

107.8 

31 

5 

4.8 

7 

105.3 

32 

3 

2.9 

24 

104.2 

33 

7 

7.0 

10 

99.6 

34 

8 

8.5 

7 

93.8 

35 

11 

12.3 

7 

89.3 

36 

9 

10.1 

0 

89.1 

37 

11 

12.7 

10 

86.8 

38 

8 

9.4 

1 

85.6 

39 

6 

7.2 

16 

83.3 

40 

4 

5.3 

0 

75.9 

41 

10 

18.8 

1 

53.3 

42 

8 

15.8 

16 

50.7 

43 

2 

4.3 

16 

46.3 

44 

6 

14.6 

0 

41.0 

45 

19 

50.7 

1 

37.5 

46 

3 

8.2 

7 

36.6 

47 

2 

5.6 

1 

35.8 

48 

3 

9.3 

1 

32.1 

49 

28 

88.7 

0 

31.6 

50 

6 

19.6 

1 

30.6 

51 

1 

3.4 

1 

29.1 

52 

1 

3.6 

0 

27.6 

53 

1 

5.7 

1 

17.4 

54 

1 

7.0 

1 

14.2 

55 

0 

4.2 

16 

.0 

56 

0 

1.8 

10 

.0 


43, 50 

24, 38, 42, 44, 45, 56 
14, 24, 27, 32, 35, 46, 47 
14, 27, 31, 35 
18, 28, 45, 56 

23, 29, 39, 40, 42, 43, 51, 52, 54 

14, 31, 32, 37, 46 
23, 37, 39, 41 
23, 35, 36, 41, 46 

30, 42, 44, 49, 51, 54 

23, 34, 36, 40, 41 
34, 39, 41, 49, 52 

36, 37, 39, 40, 46, 49, 53 
26, 30, 34, 38, 43, 51 
26, 29, 34, 42 

24, 30, 38, 48, 49 
28, 30, 33, 56 

31, 35, 37, 41, 47, 53 
24, 31, 46, 48, 49, 53 
24, 44, 47, 49 

38, 40, 41, 44, 47, 48, 52, 53, 54 

15, 21, 29 
34, 38, 42, 54 
34, 40, 49, 54 
41, 46, 47, 49 
34, 38, 49, 51, 52 
18, 20, 24, 27, 56 
18, 24, 30, 33, 45, 55 


Table 7.5 (continued) 


(b) Repeat the previous analysis using the mean structure 

Pi = axi/10 + <f>i 

where the <j>i have the CAR adjacency prior (7.27). How do your 
results change? 

(c) Conduct an informal comparison of these two models using the D1C 
criterion (6.25); which model emerges as better? Also use equation 
(6.24) to compare the two effective model sizes pp. 
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(d) How might we determine whether significant spatial structure remains 
in the data after accounting for the covariate? 

(e) Suppose we wanted to fit a model with both the heterogeneity terms 
(Oi) and the clustering terms (<p,). What constraints or modifications 
to the prior distributions would be required to insure model identifi- 
ability (and hence, MCMC sampler convergence)? 
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Figure 7.4 Ten ozone monitoring stations (source locations) and two target loca¬ 
tions, greater Atlanta. 



Figure 7.5 Scotland Up cancer data: a) crude standardized mortality ratios ( 0 b- 
served/expected X 100); b) AFFcovariate values. 
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CHAPTER 8 


Case studies 


Most skills are best learned not through reading or quiet reflection, but 
through actual practice. This is certainly true of data analysis, an area with 
few hard-and-fast rules that is as much art as science. It is especially true 
of Hayes and empirical Bayes data analysis, with its substantial computing 
component and the challenging model options encouraged by the breadth 
of the methodology. The recent books edited by by Gatsonis et al. (1993, 
1995, 1997, 1999, 2000), Gilks et al. (1996), and Berry and Stangl (1996) 
offer a broad range of challenging and interesting case studies in Bayesian 
data analysis, most using modern MCMC computational methods. 

In this chapter we provide three fully worked case studies arising from 
real-world problems and datasets. While all have a distinctly biomedical 
flavor, they cover a broad range of methods and data types that transfer 
over to applications in agriculture, economics, social science, and industry. 
All feature data with complex dependence structures (often introduced by 
an ordering in time, space, or both), which complicate classical analysis and 
suggest a hierarchical Bayes or EB approach as a more viable alternative. 

Our first case study evaluates drug-induced response in a series of re¬ 
peated measurements on a group of AIDS patients using both longitudinal 
modeling and probit regression. We then compare the survival times of 
patients in each drug and response group. The second study involves se¬ 
quential analysis, in the context of monitoring a clinical trial. The results of 
using carefully elicited priors motivates a comparison with those obtained 
with vague priors, as well as the sort of robust analysis suggested in Sub¬ 
section 6.1.2. Finally, our third case study concerns the emerging field of 
spatial statistics, and illustrates the predictive approach to model selection 
described in Subsection 6.5. Here we obtain estimates and predictions for 
county-specific lung cancer rates by gender and race over a 21-year pe¬ 
riod in the U.S. state of Ohio. We also assess the impact of a potential air 
contamination source on the rates in the areas that surround it. 
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8.1 Analysis of longitudinal AIDS data 

8.1.1 Introduction and background 

Large-scale clinical trials often require extended follow-up periods to eval¬ 
uate the clinical efficacy of a new treatment, as measured for example by 
survival time. The resulting time lags in reporting the results of the trial 
to the clinical and patient communities are especially problematic in AIDS 
research, where the life-threatening nature of the disease heightens the need 
for rapid dissemination of knowledge. A common approach for dealing with 
this problem is to select an easily-measured biological marker, known to be 
predictive of the clinical outcome, as a surrogate endpoint. For example, 
the number of CD4 lymphocytes per cubic millimeter of drawn blood has 
been used extensively as a surrogate marker for progression to AIDS and 
death in drug efficacy studies of HIV-infected persons. Indeed, an increase 
in the average CD4 count of patients receiving one of the drugs in our 
study (didanosine) in a clinical trial was the principal basis for its limited 
licensing in the United States. 

Several investigators, however, have questioned the appropriateness of 
using CD4 count in this capacity. For example, Lin et al. (1993) showed 
that CD4 count was not an adequate surrogate for first opportunistic infec¬ 
tion in data from the Burroughs Wellcome 02 trial, nor was it adequate for 
the development of AIDS or death in AIDS Clinical Trials Group (ACTG) 
Protocol 016 data. Choi et al. (1993) found CD4 count to be an incom¬ 
plete surrogate marker for AIDS using results from ACTG Protocol 019 
comparing progression to AIDS for two dose groups of zidovudine (AZT) 
and placebo in asymptomatic patients. Most recently, the results of the 
joint Anglo-French "Concorde" trial (Concorde Coordinating Committee, 
1994) showed that, while asymptomatic patients who began taking AZT 
immediately upon randomization did have significantly higher CD4 counts 
than those who deferred AZT therapy until the onset of AIDS-related com¬ 
plex (ARC) or AIDS itself, the survival patterns in the two groups were 
virtually identical. 

We analyze data from a trial involving 467 persons with advanced HIV 
infection who were randomly assigned to treatment with the antiretrovi¬ 
ral drugs didanosine (ddl) or zalcitabine (ddC). This study, funded by the 
National Institutes of Health (NIH) and conducted by the Community Pro¬ 
grams for Clinical Research on AIDS (CPCRA), was a direct randomized 
comparison of these two drugs in patients intolerant to, or failures on, AZT. 
The details of the conduct of the ddl/ddC study are described elsewhere 
(Abrams et al., 1994; Goldman et al., 1996); only the main relevant points 
are given here. The trial enrolled HIV-infected patients with AIDS or two 
CD4 counts of 300 or less, and who fulfilled specific criteria for AZT in¬ 
tolerance or failure. CD4 counts were recorded at study entry and again 
at the 2, 6, 12, and 18 month visits (though some of these observations 
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are missing for many individuals). The study measured several outcome 
variables; we consider only total survival time. 

Our main goal is to analyze the association among CD4 count, survival 
time, drug group, and AIDS diagnosis at study entry (an indicator of dis¬ 
ease progression status). Our population of late-stage patients is ideal for 
this purpose, since it contains substantial information on both the actual 
(survival) and surrogate (CD4 count) endpoints. 

8.1.2 Modeling of longitudinal CD4 counts 

Let Yij denote the j th CD4 measurement on the i th individual in the study, 
j = 1 Si and i = I.... , n. Arranging each individual's collection of 
observations in a vector Y, = (Y,i, ..., Y ts ,) T we attempt to fit the mixed 
effects model (7.10), namely, 

Yi — X, q + W ,■ /?,■ + ti , 

where as in Section 7.4, X, is an s, x p design matrix, aisa/jx 1 vector 
of fixed effects, W; is an .$,■ x q design matrix (q typically less than p), and 
(3i is a q x 1 vector of subject-specific random effects, usually assumed to 
be normally distributed with mean vector 0 and covariance matrix V. We 
assume that given $, the components of e, are independent and normally 
distributed with mean 0 and variance a 2 (marginalizing over /%, the Y, com¬ 
ponents are again correlated, as desired). The d,;s capture subject-specific 
effects, as well as introduce additional variability into the model. 

Since we wish to detect a possible increase in CD4 count two months 
after baseline, we fit a model that is linear but with possibly different 
slopes before and after this time. Thus the subject-specific design matrix 
W i for patient i in equation (7.10) has j th row 

w u — (1 > tij > (tij ~ 2) + ) , 

where t(j £ {0, 2,6,12,18} and z + = max( 2 ,0). Hence the three columns 
ofW< correspond to individual-level intercept, slope, and change in slope 
following the changepoint, respectively. We account for the effect of covari¬ 
ates by including them in the fixed effect design matrix X,-. Specifically, we 
set 

X,- = (W; | diWi | a, W s ) , (8.1) 

where di is a binary variable indicating whether patient i received ddl 
=1) or ddCTdj = 0), and a,- is another binary variable telling whether 
the patient was diagnosed as having AIDS at baseline a; = 1) or not 
(a t - = 0). Notice from equation (8.1) that we ha we. p = 3<? = 9; the two 
covariates are being allowed to affect any or all of the intercept, slope, 
and change in slope of the overall population model. The corresponding 
elements of the a vector then quantify the effect of the covariate on the form 
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Figure 8.1 Exploratory plots of CD4 count, cldl/ddC data. 


of the CD4 curve. In particular, our interest focuses on the a parameters 
corresponding to drug status, and whether they differ from 0. 

Adopting the usual exchangeable normal model for the random effects, 
we obtain a likelihood of the form 

n n 

[] N Si (Yi |X,u + Wi/%, <r 2 I 5t ) N 3 (/3i |0, V) , (8.2) 

i -1 *=1 

where Nk(-\p, 27) denotes the A:-dimensional normal distribution with mean 
vector p and covariance matrix 27. To complete our Bayesian model speci¬ 
fication, we adopt independent prior distributions A^ 9 (a|c, D), IG(a 2 \a, b), 
and IWfVKpR) -1 ,/?), where IG and IW denote the inverse gamma and 
inverse Wishart distributions, respectively (see Appendix A for details). 

Turning to the observed data in our study, boxplots of the individual CD4 
counts for the two drug groups, shown in Figures 8.1(a) and (b), indicate 
a high degree of skewness toward high CD4 values. This, combined with 
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the count nature of the data, suggests a square root transformation for 
each group. As Figures 8.1(c) and (d) show, this transformation improves 
matters considerably. The sample medians, shown as white horizontal bars 
on the boxplots, offer reasonable support for our assumption of a linear 
decline in square root CD4 after two months. The sample sizes at the five 
time points, namely (230. 182, 153, 102, 22) and (236, 186, 157, 123, 14) 
for the ddl and ddC groups, respectively, indicate an increasing degree 
of missing data. The total number of observations is 1405, an average of 
approximately 3 per study participant. 

Prior deterrn ination 

Values for the parameters of our priors (the hyperparameters) may some¬ 
times be determined from past datasets (Lange, Carlin, and Gelfand, 1992), 
or elicited from clinical experts (Chaloner et al., 1993; see also Section 8.2). 
In our case, since little is reliably known concerning the CD4 trajectories 
of late-stage patients receiving ddl or ddC, we prefer to choose hyperpa¬ 
rameter values that lead to fairly vague, minimally informative priors. Care 
must be taken, however, to ensure that this does not lead to an improper 
posterior distribution. For example, taking p extremely small and D ex¬ 
tremely large would lead to confounding between the fixed and random 
effects. (To see this, consider the simple but analogous case of the oneway 

layout, where Y.\ N(p . + a*, a 2 ) i = 1and we place noninfor- 

mative priors on both p and the a,. To avoid this while still allowing the 
random effects a reasonable amount of freedom, we adopt the rule of thumb 
wherein p — n/20 and 

R = Diag((ri/8) 2 , (fo/S) 2 , (r 3 /8) 2 ) , 

where r, is the total range of plausible parameter values across the individ¬ 
uals. Since R is roughly the prior mean of V, this gives a ±2 prior standard 
deviation range for P, that is half of that plausible in the data. Since in our 
case Yij corresponds to square root CD4 count in late-stage HIV patients, 
we take n = 16 and r z = r ;i = 2. and hence our rule produces p = 24 and 
R = Diag(2 2 , (.25) 2 , (.25) 2 ). 

Turning to the prior on a 2 we take a = 3 and b = .005, so that <r 2 has 
both mean and standard deviation equal to 10 2 , a reasonably vague (but 
still proper) specification. Finally, for the prior on a, we set 

c = (10,0,0, 0,0,0,-3,0,0), and 
D = Diag{ 2 2 , l 2 , l 2 , (. 1 ) 2 ,1 2 , l 2 , l 2 , l 2 , l 2 ) , 

a more informative prior but strongly biased away from 0 only for the 
baseline intercept, otj, and the intercept adjustment for a positive AIDS 
diagnosis, a 7 . These values correspond to our expectation of mean baseline 
square root CD4 counts of 10 and 7 (100 and 49 on the original scale) for 
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the AIDS-negative and AIDS-positive groups, respectively. This prior forces 
the drug group intercept (i.e., the effect at baseline) to be very small, since 
patients were assigned to drug group at random. Indeed, this randomization 
implies that 04 = 0. Also, the prior allows the data to determine the degree 
to which CM trajectories depart from horizontal, both before and after 
the changepoint. Similarly, though a and the /?, are uncorrelated a priori, 
the priors are sufficiently vague to allow the data to determine the proper 
amount of correlation a posteriori. 

Since the prior is fully conjugate with the likelihood, the Gibbs sam¬ 
pler is an easily-implemented MCMC method. The required full condi¬ 
tional distributions for this model are similar to those in Example 5.6, and 
strongly depend on conjugacy and the theoretical results in Lindley and 
Smith (1972). They are also presented in Section 2.4 of Lange, Carlin, and 
Gelfand (1992), who used them in a changepoint analysis of longitudinal 
CD4 cell counts for a group of HIV-positive men in San Francisco. Alterna¬ 
tively, our model could be described graphically, and subsequently analyzed 
in BUGS (Gilks, Thomas, and Spiegelhalter, 1992; Spiegelhalter, Thomas, 
Best, and Gilks, 1995a,b). This graphical modeling and Bayesian sampling 
computer package frees the user from these computational details. See Ap¬ 
pendix Subsection C.3.3 for a brief description of graphical modeling and 
instructions on obtaining the BUGS package. 

Results 

We fit our model by running 5 parallel Gibbs sampler chains for 500 itera¬ 
tions each. These calculations were performed using Fortran 77 with IMSL 
subroutines on a single Sparc 2 workstation, and took approximately 30 
minutes to execute. A crudely overdispersed starting distribution for V -1 
was obtained by drawing 5 samples from a Wishart distribution with the 
R matrix given above but having p = dim(/l,) = 3, the smallest value 
for which this distribution is proper. The g th chain for each of the other 
parameters was initialized at its prior mean plus (g - 3) prior standard 
deviations, g = 1, . . ., 5. While individually monitoring each of the 1417 
parameters is not feasible. Figure 8.2 shows the resulting Gibbs chains for 
the first 6 components of a, the 3 components of /? 8 (a typical random 
effect), the model standard deviation < 7 , and the (1,1) and (1,2) elements of 
V- 1 , respectively. Also included are the point estimate and 95 th percentile 
of Gelman and Rubin's (1992) scale reduction factor (labeled "G&R" in the 
figure), which measures between-chain differences and should be close to 1 
if the sampler is close to the target distribution, and the autocorrelation 
at lag 1 as estimated by the first of the 5 chains (labeled "lag 1 acf" in the 
figure). The figure suggests convergence for all parameters with the possi¬ 
ble exception of the two V -1 components. While we have little interest in 
posterior inference on V -1 , caution is required since a lack of convergence 
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GSR: 1,1.01; lag 1 acf = 0.375 


GSR: 1.02,1.05; lag lad = 0.279 


GSR: 1.02,1.05; lag 1 acf=0.325 



GSR: 1.01,1.03; lag 1 acf = 0.15 


GSR: 1.15,1.38; lag 1 acf=0.901 


GSR: 1.07,1.19; lag 1 acf=0.906 



Figure 8.2 Convergence monitoring plots, five independent chains. Horizontal 
axis is iteration number; vertical axes are o-i, a 2 , a 3 , ft 4 , as a 6 08i, 08 2 08 3 
v, (V-%, and (V _1 ) 12 . 


on its part could lead to false inferences concerning the other parameters 
that do appear to have converged (Cowles, 1994). In this case, the culprit 
appears to be the large positive auto correlations present in the chains, a 
frequent cause of degradation of the Gelman and Rubin statistic in oth¬ 
erwise well-behaved samplers. Since the plotted chains appear reasonably 
stable, we stop the sampler at this point, concluding that an acceptable 
degree of convergence has been obtained. 
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mode 

95% interval 

Baseline 

intercept 

a 1 

9.938 

9.319 

10.733 


slope 

Ot'2 

-0.041 

-0.285 

0.204 


change in slope 

a 3 

-0.166 

-0.450 

0.118 

Drug 

intercept 

a 4 

0.004 

-0.190 

0.198 


slope 

0/5 

0.309 

0.074 

0.580 


change in slope 

a 6 

-0.348 

-0.671 

-0.074 

AIDS Dx 

intercept 

O' 7 

-4.295 

-5.087 

-3.609 


slope 

0 's 

-0.322 

-0.588 

-0.056 


change in slope 

«9 

0.351 

0.056 

0.711 


Table 8 .1 Point and interval estimates, fixed effect parameters. 


Based on the visual point of overlap of the 5 chains in Figure 8.2, we dis¬ 
carded the first 100 iterations from each chain to complete sampler "burn- 
in." For posterior summarization, we then used all of the remaining 2000 
sampled values despite their autocorrelations, in the spirit of Geyer (1992). 
For the model variance <r 2 , these samples produce an estimated posterior 
mean and standard deviation of 3.36 and 0.18, respectively. By contrast, 
the estimated posterior mean of Vn is 13.1, suggesting that the random 
effects account for much of the variability in the data. The dispersion ev¬ 
ident in exploratory plots of the 3 % posterior means (not shown) confirms 
the necessity of their inclusion in the model. 

Posterior modes and 95% equal-tail credible intervals (based on posterior 
mixture density estimates) for all 9 components of the fixed effect param¬ 
eter vector a are given in Table 8.1. The posterior credible interval for a 4 
is virtually unchanged from that specified in its prior, confirming that the 
randomization of patients to drug group has resulted in no additional infor¬ 
mation on <24 in the data. The 95% credible sets of (.074, .580) and (-.671, 
-.074) for a 5 (pre-changepoint drug slope) and a 4 (post- changepoint drug 
slope), respectively, suggest that the CD4 trajectories of persons receiving 
ddl have slopes that are significantly different from those of the ddC pa¬ 
tients, both before and after the two-month changepoint. AIDS diagnosis 
also emerges as a significant predictor of the two slopes and the intercept, 
as shown in the last three lines in the table. 

Plots of the fitted population CD4 trajectories for each of the four possi¬ 
ble drug-diagnosis combinations (not shown) suggest that an improvement 
in square root CD4 count typically occurs only for AIDS-negative patients 
receiving ddl. However, there is also some indication that the ddC trajec¬ 
tories "catch up" to the corresponding ddl trajectories by the end of the 
observation period. Moreover, while the trajectories of ddl patients may be 
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Model 1: Vinv - W(rho =24, R = Diag( 4 0.0625 0.0625)) 


Model 2: Vinv - Wfrtio = 24, R = Diag( 4 0.0625)) 




• 2-1012 


- 2-1012 


quantiles of standard normal, Model 1 


quantiles of standard normal, Model 2 


Figure 8.3 Bayesian model choice based on 70 random observations from the 1405 
total. Model 1: changepoint model; model 2: linear decay model. 


somewhat higher, the difference is quite small, and probably insignificant 
clinically - especially compared to the improvement provided by a negative 
AIDS diagnosis at baseline. 

The relatively small differences between the fitted pre- and post-change- 
point slopes suggest that a simple linear decline model might well be suf¬ 
ficient for these data. To check this possibility, we ran 5 sampling chains 
of 500 iterations each for this simplified model. We compared these two 
models using the cross-validation ideas described near the beginning of 
Section 6.5, using a 5% subsample of 70 observations, each from a different 
randomly selected individual. Figure 8.3 offers a side-by-side comparison 
of histograms and normal empirical quantile-quantile plots of the resulting 
residuals for Model 1 (changepoint) and Model 2 (linear). The q-q plots 
indicate a reasonable degree of normality in the residuals for both models. 
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a) y_r - E(y_r) versus sd(y r); plotting character indicates model number 



Figure 8.4 Bayesian model choice based on 70 random observations from the 1405 
total. Model 1: changepoint model; model 2: linear decay model. 


The absolute values of these residuals for the two models sum to 66.37 and 
70.82, respectively, suggesting almost no degradation in fit using the sim¬ 
pler model, as anticipated. This similarity in quality of fit is confirmed by 
Figure 8.4(a), which plots the residuals versus their standard deviations. 
While there is some indication of smaller standard deviations for residuals 
in the full model (see e.g. the two clumps of points on the right-hand side 
of the plot), the reduction appears negligible. 

A pointwise comparison of CPO is given in Figure 8.4(b), along with a 
reference line marking where the values are equal for the two models. Since 
larger CPO values are indicative of better model fit, the predominance of 
points below the reference line implies a preference for Model 1 (the full 
model). Still, the log(CPO) values sum to -130.434 and -132.807 for the 
two models, respectively, in agreement with our previous finding that the 
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improvement offered by Model 1 is very slight. In summary, while the more 
complicated changepoint model was needed to address the specific clinical 
questions raised by the trial's designers concerning a CD4 boost at two 
months, the observed CD4 trajectories can be adequately explained using 
a simple linear model with our two covariates. 

Finally, we provide the results of a more formal Bayesian model choice 
investigation using the marginal density estimation approach of Subsec¬ 
tion 6.3.2 to compute a Bayes factor between the two models. We begin 
with a trick recommended by Chib and Carlin (1999): for Model 1, we write 
P({/3i}>«k 2 > V -1 ,y) = p({/3,}|a, <T 2 , V -1 , y)p(a|<7 2 , V -1 , y), and recog¬ 
nize that the reduced conditional p(a\cr 2 , V -1 , y) is available in closed 
form. That is, we marginalize the likelihood over the random effects, ob¬ 
taining 

Y,-|a, a 2 , V, M = 1 ~ N (X,a, <t 2 I s> + W,VW J) , 

from which we derive the reduced conditional for a. This trick somewhat 
reduces autocorrelation in the chains, and also allows us to work with the 
appropriate three-block extension of equation (6.14), 

log/(Y|M = 1) = log/(Y|a', (<r 2 )', (V" 1 )', M = 1) 

+ logp(a', (<r 2 )', (V-iyiM = 1) - logp(«'| Y, (a 2 )', (V- 1 )', M — 1) 

- logp((<r 2 y|Y, (V- 1 )', M= 1) - logpftV- 1 )'|Y, M = 1) 

for Model 1, and similarly for Model 2. 

Using 5 chains of 1000 post-burn-in iterations each, we estimated the 
log marginal likelihood as -3581.795 for Model 1 and -3577.578 for Model 
2. These values are within .03 of values obtained independently for this 
problem by Chib (personal communication), and produce an estimated 
Bayes factor of exp[-3577.578- (-3581.795)] = 67.83, moderate to strong 
evidence in favor of Model 2 (the simpler model without a change point). 
This is of course consistent with our findings using the less formal model 
choice tools desribed above. 

Summary and discussion 

In modeling longitudinal data one must think carefully about the number 
of random effects to include per individual, and restrict their variability 
enough to produce reasonably precise estimates of the fixed effects. In our 
dataset, while there was some evidence that heterogeneous variance pa¬ 
rameters a 2 might be needed, including them would increase the number 
of subject-specific parameters per individual to four - one more than the 
average number of CD4 observations per person in our study. Fortunately, 
accidental model overparametrization can often be detected by ultrapoor 
performance of multichain convergence plots and diagnostics of the va¬ 
riety shown in Figure 8.2. In cases where reducing the total dimension 
of the model is not possible, recent work by Gelfand, Satin, and Carlin 
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(1995, 1996) indicates that the relative prior weight assigned to variabil¬ 
ity at the model's first stage (cr 2 ) versus its second stage (V) can help 
suggest an appropriate transformation to improve algorithm convergence. 
More specifically, had our prior put more credence in small cr 2 values and 
larger random effects, a sampler operating on the transformed parameters 
a* = a, /3* — Pi + a would have been recommended. 

Missing data are inevitable, and consideration of the possible biasing 
effects of missingness is very important. Validity of inferences from our 
models depends on the data being observed at random (i.e., missingness 
cannot depend on what one would have observed) and an appropriate model 
structure. Little and Rubin (1987) refer to this combination as "ignorable 
missingness." In our application, missing CD4 counts are likely the result 
of patient refusal or a clinical decision that the patient was too ill to have 
CD4 measured. Thus our data set is very likely biased toward the rela¬ 
tively higher CD4 counts in the later follow-up periods; we are essentially 
modeling CD4 progression among persons who continue to have CD4 mea¬ 
sured. As a check on ignorability, one can explicitly model the missingness 
mechanism using Tobit-type or logistic models that link missingness to the 
unobserved data, and compare the resulting inferences. See Cowles et al. 
(1996) for an application of the Tobit approach in estimating compliance 
in a clinical trial. 

8.1.3 CD4 response to treatment at two months 

The longitudinal changepoint model of the previous section takes advantage 
of all the observations for each individual, but provides only an implicit 
model for drug-related CD4 response. For an explicit model, we reclassify 
each patient as "response" or "no response," depending on whether or not 
the CD4 count increased from its baseline value. That is, for i = 1, . . ., m = 
367, the number of patients who had CD4 measured at both the baseline 
and two month visits, we define R, = 1 if Y l2 — Yu > 0, and Ri — 0 
otherwise. Defining p i as the probability that patient i responded, we fit 
the probit regression model 

Pi = $(7o +7idi +J 2 di) , (8.3) 

where $ denotes the cumulative distribution function of a standard normal 
random variable, the js are unknown coefficients, and the covariates d t 
and <n describe treatment group and baseline AIDS diagnosis as in Subsec¬ 
tion 8.1.2. We used probit regression instead of logistic regression since it 
is more easily fit using the Gibbs sampler (Carlin and Poison, 1992; Albert 
and Chib, 1993a). 

Under a flat prior on ( 70 , 7 i, 72 )', the posterior distributions for each 
of the 7 ' coefficients were used to derive point and interval estimates of 
the corresponding covariate effects. In addition, given particular values of 
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the covariates, the posterior distributions were used with equation (8.3) 
to estimate the probability of a CD4 response in subgroups of patients by 
treatment and baseline prognosis. The point and 95% interval estimates 
for the regression coefficients obtained using the Gibbs sampler are shown 
in Table 8.2. Of the three intervals, only the one for prior AIDS diagnosis 
excludes 0, though the one for treatment group nearly does. The signs 
on the point estimates of these coefficients indicate that patients without 
a prior AIDS diagnosis and, to a lesser extent, those in the ddl group 
were more likely to experience a CD4 response, in agreement with our 
Subsection 8.1.2 results. 


point 95% confidence limits 



estimate 

lower 

upper 

7 o (intercept) 

.120 

-.135 

.378 

7 i (treatment) 

.226 

-.040 

.485 

72 (AIDS diagnosis) 

-.339 

-.610 

-.068 


Table 8.2 Point and interval estimates, CD4 response model coefficients 

Converting these results for the regression coefficients to the probability- 
of-response scale using equation (8.3), we compare results for a typical 
patient in each drug-diagnosis group. Table 8.3 shows that the posterior 
median probability of a response is larger by almost .09 for persons tak¬ 
ing ddl, regardless of health status. If we compare across treatment groups 
instead of within them, we see that the patient without a baseline AIDS 
diagnosis has roughly a .135 larger chance of responding than the sick pa¬ 
tient, regardless of drug status. Notice that for even the best response group 
considered (baseline AIDS-free patients taking ddl), there is a substantial 
estimated probability of not experiencing a CD4 response, confirming the 
rather weak evidence in favor of its existence under the full random effects 
model (7.10). 


ddl ddC 

AIDS diagnosis at baseline .502 .413 

No AIDS diagnosis at baseline .637 .550 


Table 8.3 Point estimates, probability of response given treatment and prognosis 


8.1.4 Survival analysis 

In order to evaluate the clinical consequences of two-month changes in the 
CD4 lymphocyte count associated with the study drugs, we performed a 
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parametric survival analysis. As in Subsection 8.1.3, the m = 367 patients 
with baseline and two-month CD4 counts were divided into responders 
and non-responders. We wish to fit a proportional hazards model, wherein 
the hazard function h for a person with covariate values z and survival or 
censoring time t takes the form h(t\z, /3) = ho(t) exp(z'/3). In our case, we 
define four covariates as follows: zq = 1 for all patients; z\ = 1 for ddl 
patients with a CD4 response, and 0 otherwise; z 2 = 1 for ddC patients 
without a CD4 response, and 0 otherwise; and z 3 = 1 for ddC patients 
with a CD4 response, and 0 otherwise. Writing (3 — (/3q, ih , fl'i)' and 
following Cox and Oakes (1984, Sec. 6.1), we obtain the loglikelihood 


m 

log L(P) — log h(t{ jzj, /3) -|- JTlogSfel*,*), (8.4) 

i £14 i = 1 

where S denotes the survival function, U the collection of uncensored failure 
times, and z, = ( zg,. z\i, z 2 i, z$iY • Our parametrization uses nonresponding 
ddl patients as a reference group; p 1 ,p 2 , and /) 3 capture the effect of being 
in one of the other 3 drug-response groups. 

We followed Dellaportas and Smith (1993) by beginning with a Weibull 
model for the baseline hazard, ho(t) = pt p ~ l , but replaced their rejection 
sampling algorithm based on the concavity of the loglikelihood (8.4) with 
the easier-to-program Metropolis subchain approach (see Subsection 5.4.4). 
Our initial concern that the Weibull model might not be rich enough evap¬ 
orated when we discovered extremely high posterior correlation between p 
and /?o • This suggests the baseline hazard may reasonably be thought of 
as exponential in our very ill population, and hence we fixed p = 1 in all 
subsequent calculations. 

The resulting estimated marginal posterior distributions for the /?s were 
fairly symmetric, and those for f3\ and d 2 were centered near 0. However, 
the 95% equal-tail posterior credible set for /? 3j (-1.10, 0.02), suggests pre¬ 
dominantly negative values. To ease the interpretation of this finding, we 
transform the posterior (3 samples into corresponding ones from the sur¬ 
vival function using the relation S(t\z,(3) — exp {—t exp(z / ^)}. We do this 
for each drug-response group over a grid of 9 equally-spaced t values from 
0 to 800 days. Figure 8.5(a) gives a smoothed plot of the medians of the 
resulting samples, thus providing estimated posterior survival functions for 
the four groups. As expected, the group of ddC responders stands out, 
with substantially improved mortality. A more dramatic impression of this 
difference is conveyed by the estimated posterior distributions of median 
survival time, 9(z) = (log 2) exp(—z'/3), shown in Figure 8.5(b). While a 
CD4 response translates into improved survival in both drug groups, the 
improvement is clinically significant only for ddC recipients. 
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Figure 8.5 Posterior summaries, ddl/ddC data: (a) estimated survival functions; 
(b) median survival time distributions. 


8.1.5 Discussion 

Analysis of our CD4 data using both a longitudinal changepoint model 
and a simpler probit regression model suggests that ddC is less successful 
than ddl in producing an initial CD4 boost in patients with advanced 
HIV infection. However, this superior short-term performance does not 
translate into improved survival for ddl patients. In fact, it is the patients 
"responding" to ddC who have improved survival. It is quite possible that 
these ddC "responders'" are simply those persons whose CD4 would have 
increased in the absence of treatment, whereas the ddl "responders" reflect 
the short-term increase in CD4 produced by ddl. These results provide 
another example of the caution needed in evaluating treatments on the basis 
of potential surrogate markers (like CD4 count), and on the use of post¬ 
randomization information to explain results. The CD4 count has strong 
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prognostic value, but treatment-induced modifications are not similarly 
prognostic. 


8.2 Robust analysis of clinical trials 

8.2.1 Clinical background 

In this section we consider another case study related to AIDS, but in the 
context of the interim monitoring and final analysis of an ongoing clini¬ 
cal trial, namely, the CPCRA toxoplasmic encephalitis (TE) prophylaxis 
trial. When the degree of immune damage becomes sufficiently severe, an 
HIV-infected person may develop a specific subset of more than 20 in¬ 
fections, several cancers, a variety of neurological abnormalities including 
severe declines in mental function, and wasting. Among the most ominous 
infections is encephalitis due to Toxoplasma gondii. This infection is the 
cause of death in approximately 50% of persons who develop it and the 
median survival is approximately six months. Additional clinical and im¬ 
munological background concerning TE is provided in the review paper by 
Carlin, Chaloner, Louis, and Rhame (1995). 

All patients entered into the study had either an AIDS defining illness 
or a CD4 count of less than 200. In addition all had a positive titre for 
toxoplasma gondii and were therefore at risk for TE. The trial was origi¬ 
nally designed with four treatment groups: active clindamycin, placebo clin¬ 
damycin, active pyrimethamine, and placebo pyrimethamine with an allo¬ 
cation ratio of 2:1:2:1, respectively. It was anticipated that the two placebo 
groups would be pooled in a final analysis for a three treatment group 
comparison of placebo, active pyrimethamine, and active clindamycin. 

The first patient was randomized in September 1990. The active and 
placebo clindamycin treatments were terminated soon after in March 1991, 
due to non-life-threatening toxicides that resulted in the discontinuation 
of medication for many patients in the active clindamycin group. People 
on either active or placebo clindamycin were offered rerandomization to 
either active or placebo pyrimethamine with a 2:1 allocation ratio. The 
trial then became a simple two treatment study of active pyrimethamine 
against placebo pyrimethamine. 


8.2.2 Interim monitoring 

In order to ensure scientific integrity and ethical conduct, decisions con¬ 
cerning whether or not to continue a clinical trial based on the accumu¬ 
lated data are made by an independent group of statisticians, clinicians and 
ethicists who form the trial's data safety and monitoring board, or DSMB. 
These boards have been a standard part of clinical trials practice (and NIH 
policy) since the early 1970s; see Fleming (1992) for a full discussion. As 
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a result, the trial's statisticians require efficient algorithms for computing 
posterior summaries of quantities of interest to the DSMB. 

For simplicity, we consider the case of p = 2 covariates: zn indicates 
whether individual i received pyrimethamine ~ 1( =1) or placebo (z\, = 
0), and Z 2 i denotes the baseline CD4 cell count of individual i at study 
entry. Using the method and software of Chaloner et al. (1993), a prior 
for pi was elicted from five AIDS experts, coded as A, B, C, D, and E. 
Experts A, B, and C are physicians: A practices at a university AIDS 
clinic, B specializes in the neurological manifestations of AIDS and HIV 
infection, and C practices at a community clinic. Experts D and E are non¬ 
physicians involved in AIDS clinical trials: D manages studies and E is an 
infectious disease epidemiologist. Each expert's prior was then discretized 
onto a 31-point grid, each support point having probability 1/31. Finally, 
each prior was transformed to the fa scale and converted to a smooth curve 
suitable for graphical display using a histogram approach. In what follows 
we consider only the prior of expert A; the paper by Carlin et al. (1993) 
provides a more complete analysis. While the trial protocol specified that 
only the onset of TE was to be treated as an endpoint, expert A was unable 
to separate the outcomes of death and TE. As such, this prior relates to 
the dual endpoint of TE or death. 

As described in the report by Jacobson et al. (1994), the trial's data 
safety and monitoring board met on three occasions after the start of the 
trial in September of 1990 to assess its progress and determine whether it 
should continue or not. These three meetings analyzed the data available 
as of the file closing dates 1/15/91,7/31/91, and 12/31/91, respectively. 
At its final meeting, the board recommended stopping the trial based on 
an informal stochastic curtailment rule: the pyrimethamine group had not 
shown significantly fewer TE events up to that time, and, due to the low 
TE rate, a significant difference was judged unlikely to emerge in the fu¬ 
ture. An increase in the number of deaths in the pyrimethamine group 
was also noted, but this was not a stated reason for the discontinuation of 
the trial (although subsequent follow-up confirmed this mortality increase). 
The recommendation to terminate the study was conditional on the agree¬ 
ment of the protocol chairperson after unblinding and review of the data. 
As a result, the trial did not actually stop until 3/30/92, when patients 
were instructed to discontinue their study medication. 

We now create displays that form a Bayesian counterpart to standard 
monitoring boundaries. Using the Cox partial likelihood (7.15) as our like¬ 
lihood, Figure 8.6 diplays the elicited prior, likelihood, exact posterior, and 
normal approximation to the posterior of /3i for the four data monitoring 
dates listed above. The number of accumulated events is shown beneath 
each display. The exact posterior calculations were obtained by simple sum¬ 
mation methods after discretizing fa onto a suitably fine grid. Notice that 
in the first frame the likelihood is flat and the prior equals the exact pos- 
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n = 38 events 


n = 60 events 


Figure 8.6 Prior A, likelihood, and posteriors for fh on four data monitoring 
dates, TE trial data. Covariate = baseline CD4 count; monitoring dates are 
1/15/91, 7/31/91, 12/31/91, and 3/30/92. 


terior, since no information has yet accumulated. From this frame we see 
that expert A expects the pyrimethamine treatment to be effective at pre¬ 
venting TE/death; only one of the 31 prior support points are to the right 
of 0. The remaining frames show the excess of events (mostly deaths) accu¬ 
mulating in the treatment group, as the likelihood and posterior summaries 
move dramatically toward positive /?i values. The extremely limited prior 
support for these values (and in fact, no support in the region where the 
likelihood is maximized) causes the likelihood to appear one-tailed, and the 
exact posterior to assume a very odd shape. The normal approximation is 
inadequate when the sample size is small (due to the skew in the prior) and, 
contrary to the usual theory, also when it is large (due to the truncation 
effect of the prior on the likelihood). 

Next, we turn to the monitoring plot associated with the posteriors in 
Figure 8.6. The trial protocol specifies a target reduction of 50% in hazard 
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cumulative # of events (calendar date) 


Figure 8.7 Posterior monitoring plot for fj lt TE trial data. E = exact posterior; 
N = normal approximation; L = likelihood. 


for the treatment relative to control. For a more sensitive analysis we use 
a 25% reduction as the target, implying the indifference zone boundaries 
/?i,C/ =0 and fli p = log(.75) = -.288. Figure 8.7 plots these two tail 
areas versus accumulated information for the exact posterior, approximate 
normal posterior, and likelihood (or more accurately, the posterior under a 
prior on 3\ that is flat over the support range specified by our expert). This 
figure nicely captures the steady progression of the data's impact on expert 
A's prior. Using a stopping level of p = .10 (indicated by the horizonta 
reference line in the figure), the flat prior analysis recommends stopping the 
trial as of the penultimate monitoring point; by the final monitoring point 
the data converts the informative prior to this view as well. The inadequacy 
of the normal approximation is seen in its consistent overstatement of the 
upper tail probability. 
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Impact of subjective prior information 

Figure 8.7 raises several questions and challenges for further research. For 
instance, notice that expert A's prior suggests stopping immediately (i.e. 
not starting the trial) and deciding in favor of the treatment. Though none 
of the elicitees would have actually recommended such an action (recall 
the 50% reduction goal), it does highlight the issue of ethical priors for 
clinical trials, as alluded to earlier. In addition, the fact that expert A's 
clinically-derived prior bears almost no resemblance to the data, besides 
being medically interesting, is perhaps statistically troubling as well, since 
it suggests delaying the termination of an apparently harmful treatment. 
Perhaps expert A should have received more probabilistic training before 
elicitation began, or been presented with independent data on the issue. 

Prior beliefs common to all five of the experts interviewed included TE 
being common in this patient population, and a substantial beneficial pro¬ 
phylactic effect from pyrimethamine. However, data from the trial were 
at odds with these prior beliefs. For instance, out of the 396 people in 
the trial, only 12 out of 264 on pyrimethamine and only 4 out of 132 on 
placebo actually got TE. The low rate is believed to be partly due to a 
change in the standard care of these patients during the trial. During the 
trial, information was released on the prophylactic efffects of the antibi¬ 
otic trimethoprim-sulphamethoxazole (TMP/SMX, or Bactrim). TE was 
extremely rare for patients receiving this drug, and many patients in our 
study were taking it concurrently with their study medication. 

Finally, all five experts put high probability on a large beneficial effect of 
the treatment, but the rate of TE was actually higher in the pyrimethamine 
group than in the placebo group. In addition, a total of 46 out of 264 people 
died in the pyrimethamine group, compared to only 13 out of 132 in the 
placebo group. Apparently the assumption made at the design stage of the 
trial that there would be no difference in the non-TE death rates in the 
different trial arms was incorrect. 

Still, the decision makers are protected by the simultaneous display of the 
flat prior (likelihood) results with those arising from the overly optimistic 
clinical prior; the results of a skeptical prior might also be included. Overall, 
this example clearly illustrates the importance of prior robustness testing 
in Bayesian clinical trial monitoring; this is the subject of Subsection 8.2.3. 

Impact of proportional hazards 

Our data indicate that the proportional hazards assumption may not hold 
in this trial. The relative hazard for death in the pyrimethamine group 
increased over follow-up compared to the placebo group (see Figure 1 
of Jacobson et al., 1994). Specifically, while the hazard for the placebo 
group is essentially constant for the entire study period, the hazard in 
the pyrimethamine group changes level at approximately six months into 
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the follow-up period. As such, we might reanalyze our data using a model 
that does not presume proportional hazards. Several authors have proposed 
models of this type (see e.g. Louis, 1981b; Lin and Wei, 1989; Tsiatis, 1990). 
Other authors (e.g. Grambsch and Therneau, 1994) create diagnostic plots 
designed specifically to detect the presence of time-varying covariates in an 
otherwise proportional hazards setting. 

The simplest approach might be to include a parameter in the Cox model 
(7.15) that explicitly accounts for any hazard rate change. In our case, this 
might mean including a new covariate z p +\ which equals 0 if the individual 
is in the placebo group (z i = 0), equals -1 if the individual is in the 
drug group (z\ =1) and the corresponding event (failure or censoring) 
occurs prior to time t = 6 months, and equals +1 if the individual is 
in the drug group and the corresponding event occurs after time t = 6. A 
posterior distribution for 0 p+l having significant mass far from 0 would then 
confirm the assumed pattern of nonproportional hazards. Alternatively, we 
might replace /3 lZl in model (7.15) by p iZl w(t\ /? p+1 ), where w is a weight 
function chosen to model the relative hazard in the drug group over time. 
For example, w(t\ (3 p +i) = exp(/? p+1 i) could be used to model an increasing 
relative hazard over time, with (3\ — 0 corresponding to the "null" case of 
proportional hazards. 

Simultaneous display of monitoring information based on the flat prior 
(likelihood) and those based on a variety of priors (the optimistic, the 
skeptical, and a few personal) protects the monitoring board from over¬ 
dependence on a single prior or a collection of mis-calibrated personal pri¬ 
ors and indicates the robustness or lack thereof of a monitoring decision. 
Bayesian monitoring must be robust to be acceptable. 


8.2.3 Prior robustness and prior scoping 

Phase 111 clinical trials (like ours) are large scale trials designed to identify 
patients who are best treated with a drug whose safety and effectiveness has 
already been reasonably well established. Despite the power and generality 
of the Bayesian approach, it is unlikely to generate much enthusiasm in 
monitoring such trials unless it can be demonstrated that the methods 
possess good frequentist properties. Figure 8.8 displays the posterior for 
0 under a flat prior (i.e., the standardized marginal likelihood) at each 
of the four dates mentioned above. Recalling that negative values of 0 
correspond to an efficacious treatment, the shift toward positive 0 values 
evident in the third and fourth monitoring points reflects an excess of 
deaths in the treatment group. This emerging superiority of the placebo 
was quite surprising, both to those who designed the study and to those 
who provided the prior distributions. 

Consider the percentile restriction approach of Subsection 6.1.2. We first 
integrate /) 2 out of the Cox model likelihood L((3) by a simple summation 
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Figure 8.8 Posterior for the treatment effect under a flat prior, TE trial data. 
Endpoint is TE or death; covariate is baseline CD4 count. 


method after discretizing fa onto a suitably fine grid. This produces the 
marginal likelihood for fa, which we write in the notation of Section 6.1.2 
simply as f(x\0). At the fourth monitoring point, stopping probability 
p = 0.10 and weight 7r = 0.25 on the null hypothesis produce the val¬ 
ues f(x\9\) = 1.28 and c = 3/(x|10) = 0.48, where 0 = .600 is the marginal 
MLE of 0. 

Suppose we seek the a for which f( c is nonempty given a specific in¬ 
difference zone ($i,0u). Since negative values of 8 are indicative of an 
efficacious treatment, in practice one often takes dp = 0 and 9l < 0, 
the additional benefit being required of the treatment in order to justify 
its higher cost in terms of resources, clinical effort, or toxicity. We adopt 
this strategy and, following our work in the previous subsection, choose 
8l = log(.75) = -.288, so that a reduction in hazard for the treatment 
relative to control of at least 25% is required to conclude treatment supe¬ 
riority. 

At monitoring point four we have 8jj < 9 = .600, so from equation (6.6) 
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(8.5) 


the a that satisfy the condition sup G f f(x\9)dG(9) > c are such that 

a > [/(-r|^) ~ + [f- 

Under the Cox model we obtain f(x\6i) _ .02 and /(x|%) _ .18, so again 
taking p — .10 and 7r = .25, equation (8.5) simplifies to ciu > .145 cii + .273. 
For the [a,i , au ) pairs satisfying this equation, there exists at least one 
conditional (over Hi prior G that has these tail areas and permits stopping 
and rejecting Ho, given the data collected so far. Conversely, for conditional 
priors G featuring an (a^, a u ) pair located in the lower region, stopping 
and rejecting H o is not yet possible. Points lying on the boundary between 
these two regions correspond to the maximum amount of mass the prior 
may allocate to the indifference zone before stopping becomes impossible. 

Next, consider the case of stopping and rejecting H\ (i.e., accepting Ho) 
at monitoring point four. This occurs if P(0 ()|x) < p, that is, when 

f f(x\9)dG(9) < c* = /(®|0). Hence we are now interested 

in the infimum of f f(x\9)dG(9). Since min{f(x\9i), f(x\9u)} = f(x\9 l) = 

. 020 for our dataset, we have from (6.5) that the a satisfying the condition 
info / f(x\6)dG{6) < c* are such that 

a u > (1 — 50c*) - = .700 —at, (8.6) 

where we have again let p— .10 and 7 t = .25. Here, increasing it (equiva¬ 
lently c) serves to decrease the intercept of the boundary, and vice versa. 
In this regard, note that n = 0 is required to make stopping impossible for 
all a. On the other hand, if 7r is at least .529, then there is at least one 
prior for which it is possible to stop and accept the null regardless of a. 

Figure 8.9 shows the four prior tail area regions determined by the two 
stopping boundaries discussed above. Each region is labeled with an or¬ 
dered pair, the components of which indicate whether stopping to reject is 
possible for the prior tail areas in this region under the null and alternative 
hypotheses, respectively. The relatively large size of the lower left region 
indicates the mild nature of the evidence provided by these data; there are 
many prior tail area combinations for which it is not yet possible to stop 
and reject either Ho or Hi- 

Since it may be difficult for the user to think solely in terms of the 
conditional prior G, we have also labeled the axes of Figure 8.9 with the 
corresponding unconditional tail areas, which we denote by pi and pu. 
That is, p L ~ P unc (treatment superior) = ai(l — it), and p v = P„ nc (control 
superior) _ a G ( 1 — 7 r); recall that; tt = .25 in our example. Increasing 7 r 
(equivalently c) increases only the an -intercept in boundary (8.5), so that 
checking the impact of this assumption is straightforward. In particular, 
for this dataset we remark that 7r < .014 makes stopping and rejecting Hq 
possible for all a (boundary (8.5) has a^-intercept 1), whiles > .410 makes 
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Figure 8.9 Conditional prior tail area regions, with ordered pair indicating 
whether it is possible to stop and reject Ho and.Hi, respectively (Y =yes, N 
= no). 

such stopping impossible regardless of a (boundary (8.5) has ajy-intercept 

1 ). 

We now turn to the semiparametric approach of Subsection 6.1.2. We use 
a N(9\0k , approximation to the likelihood at monitoring point k, k — 
2, 3, 4, as suggested by Figure 8.8. Since our analysis is targeted toward the 
members of the Data Safety and Monitoring Board, we restrict attention to 
"skeptical" conditional priors g by setting p = 0. We may then compute the 
Bayes factor BF in favor of H o that arises from equations (6.8) and (6.9) 
as a function of the conditional prior standard deviation r for 0 < r < 4. 
To facilitate comparison with previous figures. Figure 8.10(a) plots B not 
versus r, but, instead, versus the conditional prior upper tail probability 
au a one-to-one function of r. Assigning a mass it — .25 to the null region 
(the indifference zone) and retaining our rejection threshold of p = .1, we 
would reject Ho for B < i; this value is marked on the figure with a dashed 
horizontal reference line. The message from this figure is clear: no uniform- 
normal mixture priors of the form (6.7) allow rejection of Hq at monitoring 
points 2 or 3, but by monitoring point 4, rejection of Ho is favored by all 
but the most extreme priors (i.e., those that are either extremely vague or 
essentially point masses at 0). 

Next, suppose we wish to investigate robustness as 7T varies. We reject 
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a) Bayes factor versus conditional upper tail area 



b) Conditional upper tail area versus prior mass on indifference zone; 
for combinations to the left of each curve, 
all priors result in stopping to reject Ho 

Figure 8.10 Semiparametric prior partitions, TE trial data. 

H 0 if and only if B{t) < or equivalently, 

„< [f l-rW ^r 1 

p 

Again using the one-to-one relationship between ajj and r, Figure 8.10(b) 
plots this boundary in n, ap -space for the final three monitoring points. 
Conditional on the data, all priors that correspond to points lying to the left 
of a given curve lead to rejection of H (u while all those to the right do not. 
(Note the difference in interpretation between this graph and Figure 8.9, 
where each plotted point corresponded to infinitely many priors.) We see 
the mounting evidence against H 0 in the boundaries' gradual shift to the 
right, increasing the number of priors that result in rejection. 

Finally, we turn to the fully parametric approach outlined in Subsec¬ 
tion 6.1.2. Fixing the prior mean p at 0 again, the prior support for the 
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b) Bounds on posterior probability of h l versus prior contamination factor 

Figure 8.11 Fully parametric prior partitions, TE trial data. 


indifference zone, it, is now determined by r. This loss of one degree of 
freedom in the prior means that a bivariate plot like Figure 8. 10(b) is no 
longer possible, but a univariate plot as in Figure 8.10(a) is still sensible. 
To obtain the strongest possible results for our dataset, in Figure 8.1 1(a) 
we plot the posterior probability of Hl : 0 < fli . rather than H o, versus 
r. We see that rejection of this hypothesis is possible for all skeptical nor¬ 
mal priors by the third monitoring point; by the fourth monitoring point, 
the evidence against this hypothesis is overwhelming. While this analysis 
considers a far smaller class of priors than considered in previous figures, 
it still provides compelling evidence by the third monitoring point that the 
treatment is not superior in terms of preventing TE or death. 

It is of interest to compare our results with those obtained using the 
e-contamination method of Greenhouse and Wasserman (1995). As men¬ 
tioned in Subsection 6.1.2, this method enables computation of upper and 
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lower bounds on posterior expectations for priors of the form (1 — e)go + eg, 
where g 0 is some baseline distribution and g can range across the set of 
all possible priors. Reasonably tight bounds for moderately large e suggest 
robustness of the original result to changes in the prior. Figure 8.1 1(b) 
plots the Greenhouse and Wasserman bounds for the posterior probability 
of Hl for the final three monitoring points, where we have taken qq as our 
N(6\y, t 2 ) prior with y = 0 and r = 1. At the third monitoring point, the 
posterior probability is bounded below the rejection threshold of . 1 only 
for e roughly less than .17, suggesting that we cannot stray too far from 
the baseline normal prior and maintain total confidence in our conclusion. 
By the fourth monitoring point, however, the upper bound stays below the 
threshold for e values as large as .65, indicating a substantial degree of 
robustness to prior specification in our stopping decision. 


8.2.4 Sequential decision analysis 


In this subsection, we apply the fully Bayesian decision theoretic approach 
of Subsections 7.7.2 and 7.7.3 to our dataset. The posteriors pictured in 
Figure 8.8 are nothing but standardized versions of the marginal partial 
likelihoods for the treatment effect 0 given all the data observed up to the 
given date, and so provide the necessary inputs for these approaches. 

Consider first the backward induction approach of Subsection 7.7.2 in 
this setting. The very near-normal appearance of the curves at the latter 
three time points justifies an assumption of normality for the prior and 
likelihood terms. In fact, our dataset provides a good test for the K — 2 
calculations in Subsection 7.7.2, since information that would suggest stop¬ 
ping accumulates only at the final two monitoring points. Thus we take the 
appropriate normal approximation to the 7/31/91 likelihood, a N(y, <j 2 k _■>) 
distribution with y =.021 and <7^_ 2 = .664 2 as our prior p(0) and as¬ 
sume independent N(9,cr 2 K _ l ) and N(6, a 2 K ) distributions for f(y^_] \6) 
and f(yK |$) respectively. Following our earlier discussion we assume the 
likelihood variance parameters to be known, using the data-based values 
<?k_i = .488 2 and <j 2 k = .515 2 . Coupled with the normal-normal conjugacy 
of our prior-likelihood combination, this means that the marginal distri¬ 
butions m(yK- 1 ) and rn(y^\yK~i) emerge as normal distributions as well. 
Specifically, the samples required to estimate L' 3 K ~ 2 ^ may be obtained as 

y*K-l,i ~ , CTk _2 + a K~ i)> * = 1, • ■ •, -8, 

and 


y*K,ijW. 


iid 


K-l,i 


N 


Ck-iV + Vk- 


2 VK-yi 


'K 


-2 + a K- 


u K-l u K-2 


+ <T 2 k 


where j = 1, . . . , B for each i. 
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dl (stop and decide in favor of the treatment) 
d2 (stop and decide in favor of the placebo) 
d3 (continue sampling) 



Figure 8.12 Plot of posterior expected loss l\ K ~ 2%> versus p for the three decisions 
(i =1,2, 3) computed via backward induction, using the normal/normal model 
with variances estimated from TE trial data. Choosing the treatment is optimal 
for g < -.65, while choosing the placebo is optimal for g > .36; otherwise, it is 
optimal to continue the trial to time K - 1. 


We use the same indifference zone as above; in the notation of Subsec¬ 
tion 7.7.2 these are Op = 0 and Oi = log(.75) = -.288. We adopt a very 
simple loss and cost structure, namely, * =1 and Ak = .1, both 

for all k. Performing the backward induction calculations using B = 5000 
Monte Carlo draws, we obtain l[ K ~ 2 ^ =.021, L^ A_2 * =-.309, and 
L^~ 2 ^ = -.489, implying that for this cost, loss, and distributional struc¬ 
ture, it is optimal to continue sampling. However, a one-step-back analysis 
at time K - 1 using p(9\yx-i) as the prior finds that l{ K ~^ is now the 
smallest of the three values by far, meaning that the trial should now be 
stopped and the treatment discarded. This confirms the board's actual 
recommendation, albeit for a slightly different reason (excess deaths in the 
drug group, rather than a lack of excess TE cases in the placebo group). 

Figure 8.12 expands our time K-2 analysis by replacing the observed 
7/31/91 mean p = .021 with a grid of points centered around this value. 
We can obtain the intersection points of L^~ 2 \p) with L\ K ~ 2 \p) and 

L^~ 2 \n) via interpolation as Jk-2,L = -.64 and 1K-2U = -36, respec¬ 
tively. Notice that these values determine a continuation region that con- 
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tains the observed 7/31/91 mean p _ .021, and is centered around the 
midpoint of the indifference zone, $u + 0i)/2 = 144. 

Since our loss functions are monotonic and we have normal likelihoods, 
the forward sampling method of Subsection 7.7.3 is applicable and we may 
compare its results with those above. Using the same loss and cost struc¬ 
ture, setting fj. = .021, and using G = 10 6 Monte Carlo samples, the algo¬ 
rithm quickly deduces that 7 k-2,l < p < 1k-2,u, so that continuation of 
the trial is indeed warranted. The algorithm also finds 1 = -0.260 

and Jk -1 u = -0.044, implying that the trial should be continued beyond 
the penultimate decision point (12/31/91) only if the posterior mean at 
this point lies within this interval. Note from Figure 8.8 that the observed 
mean, E(9\ypc-i) = .491, lies above this interval, again in agreement with 
the board's decision to stop at this point and reject the treatment. 

As with the backward induction method, obtaining exact values via for¬ 
ward sampling for the initial cutoffs j K - 2 ,l and jk -2,(7 is more involved, 
since it is again necessary to resort to a grid search of p values near the 
solution. Still, using G — 10 6 forward Monte Carlo draws we were able 
to confirm the values found above (-.64 and .36) via backward induction. 
While the continued need for grid search is a nuisance, we remark that for 
forward sampling, the same G Monte Carlo draws may be used for each 
candidate p value. 

Using forward sampling we may extend our analysis in three significant 
ways. First, we extend to the case k = 3, so that the 1/15/91 likelihood 
now plays the role of the prior, with the remaining three monitoring points 
representing accumulating data. Second, we return to the "positive part" 
loss structure (7.16), simply by adding this modification to our definition of 
the Monte Carlo losses . as in (7.19). Note that this is still a monotone 
loss specification (hence suitable for forward sampling in the restricted 
class that depends only on the posterior mean), but one which leads to 
substantial complications in the backward induction solution. Finally, the 
positive part loss allows us to order the losses so that incorrect decisions 
are penalized more heavily the later they are made. Specifically, we take 
s\ k) - s (k) =k+3, for k=0,1,2,3. 

Since forward sampling cannot be implemented with the improper prior 
suggested by the 1/15/91 likelihood in Figure 8 . 8 , we instead retain a nor¬ 
mal specification, i.e., a N (p. cr^) with p = 0 and (Xq = 5. This prior is quite 
vague relative to our rather precise likelihood terms. We also retain our uni¬ 
form cost structure, A\. = ■ 1 for all k, and simply set <j\ — = <x| = .5602, 

the sample mean of the three corresponding values estimated from our data, 
again giving our procedure something of an empirical Bayes flavor. Again 
using G = 10 6 Monte Carlo samples, the forward algorithm quickly de¬ 
termines 7 q l << fi << j o u (so that the trial should indeed be carried 
out), and goes on to find precise values for the remainder of the -7 vector. 
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In particular, we obtain ji i = -0.530 and ji t u = 0.224, an interval that 
easily contains the observed 7/31/91 mean of .021, confirming the board's 
decision that the trial should not be stopped this early. 

Finally, we investigate the robustness of our conclusions by replacing our 
normal likelihood with a much heavier-tailed Student's t with y = 3 degrees 
of freedom. Since there is now no longer a single sufficient statistic for the 
optimal rule need not be solely a function of the posterior mean, but we may 
still use forward sampling to find the best member in this class of posterior 
mean-based rules. We retain our previous prior, cost, and loss structure, 
and set <r\ = ,560 2 /3, k = 1, 2, 3, so that Var(Y k \9 ) = va\/{v — 2) = .5602, 
as before. Again using G = 10 6 Monte Carlo samples, our forward algorithm 
now requires univariate numerical integration to find the posterior means 
to compare to the elements of the -y vector, but is otherwise unchanged. We 
now obtain j 1 L = -0.503 and u = 0.205, a slightly narrower interval 
than before (probably since the (3 has a more concentrated central 95% 
region than the normal for a fixed variance), but one which still easily 
contains the observed 7/31/91 mean of .021. 


8.2.5 Discussion 

As seen in the progression of displays in Subsection 8.2.3, the practical use¬ 
fulness of a prior partitioning analysis often depends on the breadth of the 
class of priors considered. One must select the proper class carefully: large 
classes may lead to overly broad posterior bounds, while narrow classes 
may eliminate many plausible prior candidates. The quasiunimodal, semi- 
parametric, and fully parametric classes considered here constitute only 
a small fraction of those discussed in the Bayesian robustness literature. 
Other possibilities include density bounded classes, density ratio classes, 
and total variation classes; see Wasserman and Kadane (1992) for a discus¬ 
sion and associated computational strategies. 

For computational convenience, we have frequently assumed normal dis¬ 
tributions in the above discussion. Using Monte Carlo integration methods, 
however, prior partitioning can be implemented for any combination of like¬ 
lihood and prior. Similarly, one could apply the method in settings where 
6 is a multivariate parameter, such as a multi-arm trial or a simultaneous 
study of effectiveness and toxicity. As such, prior partitioning offers an at¬ 
tractive complement to traditional prior elicitation and robustness methods 
in a wide array of clinical trial settings. 

Subsection 8.2.4 illustrates the traditional backward induction approach 
to our two-sided univariate sequential decision problem. It also illustrates 
that the forward sampling approach dramatically reduces the computa¬ 
tional and analytical overhead, thus offering the possibility of fully Bayesian 
sequential decisionmaking with numerous interim looks, a previously un¬ 
tenable problem. Armed with this more powerful computational frame- 
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work, many challenges in the practical application of fully Bayesian meth¬ 
ods to clincial trial monitoring may now be tackled. Better guidelines for 
the proper choice of loss functions l^ k \di, 6) and corresponding sampling 
costs {.4 £}, expressed in either dollars earned or lives saved, are needed. 
For example, if an incorrect decision favoring the treatment were judged 
C times more egregious than an equivalent one favoring the placebo (i.e., 
one where the true value lay equally far from the indifference zone but 
in the opposite direction), we might choose = C ■ s k> . Extensions to 
more realistic trial settings (beyond the simple normal likelihood and prior 
illustrated above) would also be welcome. 

Finally, though we stress the importance of Bayesian robustness in the 
monitoring of Phase 111 clinical trials, more traditional Bayesian analyses 
are proving effective in the design and analysis of Phase I studies (dose¬ 
ranging, toxicity testing) and Phase II studies (indications of potential 
clinical benefit). In these settings, the ethical consequences of incorrect 
decisions are less acute than in Phase III studies, and decisions primarily 
involve the design of the next phase in drug development (including whether 
there should be another phase!). See Berry (1991, 1993) for discussions of 
these issues, and Berry et al. (2000) for a case study. 


8.3 Spatio-temporal mapping of lung cancer rates 

8.3.1 Introduction 

Environmental justice and environmental equity are emergent concepts in 
the development of environmental health policy. These concepts relate to 
whether exposures to and adverse outcomes from environmental contam¬ 
inants are shared equitably among various socio-demographic subgroups. 
Subpopulations of specific interest include low-income groups, racial and 
ethnic minorities, and sensitive individuals such as children and expectant 
mothers. 

Sound statistical methodology for assessing environmental justice (or de¬ 
tecting environmental "injustices") is needed. The assessment of environ¬ 
mental justice is a complex problem (Wagener and Williams, 1993) with 
three primary statistical components: 

(1) exposure assessment at any location within a geographic region, 

(2) accurate estimation of socio-demographic variables across the same ge¬ 
ographic region, and 

(3) disease (or other outcome of interest) incidence in small regions within 
the same region. 

This case study primarily concerns modeling associations between compo¬ 
nents (2) and (3). 

Several geostatistical smoothing, or kriging, methods have been proposed 
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for the analysis of the spatial pattern of disease. In these methods, disease 
risk is modeled as a continuous surface over the study area. Classical ap¬ 
proaches model dependence among observations as part of the single-stage 
likelihood, while Bayesian approaches typically assume observations to be 
conditionally independent given the model parameters, and subsequently 
incorporate dependence at the second stage as part of the prior distribution. 

Other approaches estimate rates and counts for administrative districts, 
usually producing a rate map. A disease map results by color-coding (or 
greyscale-coding) the estimated rates. Collections of maps for various dis¬ 
eases and/or geographic regions are often published together in a single 
atlas; see Walter and Birnie (1991) for a comparison of forty-nine disease 
atlases published during the last twenty years. 

Clayton and Kaldor (1987) and Manton et al. (1989) outline empirical 
Bayes approaches that account for spatial similarities among neighboring 
and nearby rates. Ghosh (1992) and Devine and Louis (1994) consider con¬ 
strained empirical Bayes approaches which guard against overshrinkage of 
the raw rates toward their grand mean. Cressie and Chan (1989) present 
an exploratory spatial analysis of regional counts of sudden infant death 
syndrome (SIDS). The regional aspect of the data drives their analysis, 
which is based on the autoregressive methods for data from a spatial lat¬ 
tice proposed by Besag (1974). Clayton and Bernardinelli (1992) review 
Bayesian methods for modeling regional disease rates. Besag, York, and 
Mollie(1991) begin with the Clayton and Kaldor (1987) approach and ex¬ 
tend it to a model for separating spatial effects from overall heterogeneity 
in the rates. Bernardinelli and Montomoli (1992) compare empirical and 
hierarchical Bayes methods, the latter implemented via MCMC methods. 
Breslow and Clayton (1993) place the disease mapping problem within the 
larger framework of generalized linear mixed models and provide approxi¬ 
mation schemes for inference. 

Here we extend the spatial models described by Besag et al. (1991) to 
accommodate general temporal effects, as well as space-time interactions. 
A hierarchical framework for modeling regional disease rates over space 
and time was given in Subsection 7.8.2. Our model also extends that of 
Bernardinelli, Clayton et al. (1995), who proposed a particular multiplica¬ 
tive form for space-time interaction. Subsection 8.3.2 presents our dataset 
of annual county-specific lung cancer rates by sex and race in the state of 
Ohio for the period 1968-1988, and also outlines the particular form of our 
model. Computational issues related to these models are presented in Sub¬ 
section 8.3.3, with special attention paid to parameter identifiability and 
the associated convergence of our Markov chain Monte Carlo algorithms. 
The results of fitting our basic model are given in Subsection 8.3.4, along 
with information on model validation and selection among more parsimo¬ 
nious models. The predictive model choice methods described in Section 6.5 
are seen to provide a feasible approach to model comparison in this highly 
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parametrized setting. Finally, we summarize our findings and discuss future 
implementation and interpretation issues in Subsection 8.3.5. 

8.3.2 Data and model description 

In this case study, we fit the model described in Subsection 7.8.2 to data 
originally analyzed by Devine (1992, Chapter 4). Here, yij k t is the number 
of lung cancer deaths in county i during year t for gender j and race k in 
the state of Ohio, and tiijkt is the corresponding exposed population count. 
These data were originally taken from a public use data tape (Centers for 
Disease Control, 1988) containing age-specific death counts by underlying 
cause and population estimates for every county in the United States. Our 
subset of lung cancer data are recorded for J =2 genders (male and female) 
and K = 2 races (white and nonwhite) for each of the I = 88 Ohio counties 
over an observation period of T = 21 years, namely 1968-1988 inclusive, 
yielding a total of 7392 observations. We obtain internally standardized ex¬ 
pected death counts as E ijkt = n ijkt y, where y = y ijkt / ^ ijkt n ijkt ), 

the average statewide death rate over the entire observation period. 

The study area includes the United States Department of Energy Fer- 
nald Materials Processing Center (FMPC). The Fernald facility recycles 
depleted uranium fuel from U.S. Department of Energy and Department 
of Defense nuclear facilities. The Fernald plant is located in the southwest 
corner of the state of Ohio approximately 25 miles northwest of the city of 
Cincinnati. The recycling process creates a large amount of uranium dust. 
Peak production occurred between 1951 and the early 1960s and some 
radioactive dust may have been released into the air due to inadequate fil¬ 
tration and ventilation systems. Lung cancer is of interest since inhalation 
is the primary route of the putative exposure and lung cancer is the most 
prevalent form of cancer potentially associated with exposure to uranium. 
We use reported rates for the years 1968-1988 to allow an appropriate (ten- 
to twenty-year) temporal lag between exposure and disease development. 

We apply spatio-temporal models to extend the spatial-only modeling of 
Devine (1992). The temporal component is of interest to explore changes 
in rates over a relatively long period of time. Demographic issues (related 
to environmental justice) are of interest because of possible variation in 
residential exposures for various population subgroups. In addition, the de¬ 
mographic profile of plant workers and nearby residents most likely evolved 
over the time period of interest. 

Devine (1992) and Devine, Louis, and Halloran (1994) applied Gaussian 
spatial models employing a distance matrix to the average lung cancer rates 
for white males over the 21-year period. Xia, Carlin, and Waller (1997) 
fit Poisson likelihood models incorporating the sex and race covariates to 
each year separately, thus providing a rough estimate of the change in 
heterogeneity and clustering present in the data over time, and motivating 
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our spatio-temporal analysis. We begin by fitting a version of model (7.28), 
namely, 


fiijkt = Sja + r k l3 + Sjr k t, + 1 + } , (8.7) 

where we adopt the gender and race scores 

_ f 0 if male _ f 0 if white 

Sj ~ \ 1 if female an Tk — \ 1 if nonwhite 

Letting 6^ = (0^\....., 0^)', ^ = (< j> ( p,..., <t ( p)', and denoting the 
/-dimensional identity matrix by I, we adopt the prior structure 

0 (1 >|r t ~ d N (o , and </> (< ) | A, CAR(\t)'% t = 1,..., T , (8.8) 

so that heterogeneity and clustering may vary over time. Note that the 
socio-demographic covariates (gender and race) do not interact with time 
or location. 

To complete the model specification, we require prior distributions for 
a, /?, £, the T; and the A<. Since a,/?, and £ will be identified by the like¬ 
lihood, we may employ a flat prior on these three parameters. Next, for 
the priors on the Tt and A, we employed conjugate, conditionally i.i.d. 
Gamma(a, b) and Gamma(c, cl) priors, respectively. As the discussion in 
Subsection 7.8.2 revealed, some precision is required to facilitate implemen¬ 
tation of an MCMC algorithm in this setting. On the other hand, too much 
precision risks likelihood-prior disagreement. To help settle this matter, we 
fit a spatial-only (reduced) version of model (8.7) to the data from the 
middle year in our set (1978, t = 11), using vague priors for A and r having 
both mean and standard deviation equal to 100 (a = c = 1, b = d = 100). 
The resulting posterior .025, .50, and .975 quantiles for A and r were (4.0, 
7.4, 13.9) and (46.8, 107.4, 313.8), respectively. As such, in fitting our full 
spatio-temporal model (8.7), we retain a = 1, b = 100 for the prior on r, 
but reset c = 1, d = 1 (i.e., prior mean and standard deviation equal to 
7). While these priors are still quite vague, the fact that we have used a 
small portion of our data to help determine them does give our approach 
a slight empirical Bayes flavor. Still, our specification is consistent with 
the advice of Bernardinelli, Clayton, and Montomoli (1995), who suggest 
that the heterogeneity parameters have prior standard deviation roughly 
7/10 that assigned to the clustering parameters. Recasting this advice in 
terms of prior precisions and the adjacency structure or our CAR prior for 
the we have A ~ r/(2m), where fh is the average number of counties 
adjacent to a randomly selected county (about 5-6 for Ohio). 
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8.3.3 Computational considerations 

Due to high dimensionality and complexity of our spatio-temporal model, 
some form of MCMC algorithm will be needed to obtain estimates of the 
posterior and predictive quantities of interest. The Gibbs sampler is not 
well-suited to this problem, however, because, with the exception of the At 
and r t , the full conditional distributions for the model parameters are not 
standard families. Therefore, we use instead the Metropolis algorithm to ob¬ 
tain the necessary samples. We begin with a univariate version,, each param¬ 
eter having a univariate normal candidate density with mean at the current 
value of this parameter and variance cr 2 chosen to provide a Metropolis ac¬ 
ceptance ratio between 25 and 50% (as per the recommendation of Gelman 
et al„ 1996). 

This collection of univariate Metropolis and Gibbs updating steps con¬ 
verges fairly slowly - no surprise in view of the between-parameter posterior 
correlations we would expect in our model. As we have seen, updating pa¬ 
rameters in multivariate blocks (Liu, Wong, and Kong, 1994) is one way to 
account for these correlations; another is through elementary transforma¬ 
tions (Gelfand et al., 1995, 1996). The latter approach is particularly useful 
here, and is motivated by the fact that the likelihood by itself cannot indi¬ 
vidually identify the spatial heterogeneity and clustering parameters, but 
only their sum. For simplicity, we illustrate in the case of a single period of 
observation ignoring subgroups. We make a linear transformation of vari¬ 
ables from to (0,*/) where Vi = e i + b • ■ • , Writing the 

posterior on the old scale as p(9, </>|y) oc L(0 + 4>V)p(^)p(4>) the poste¬ 
rior on the new scale is thus p(8, tj|y) oc L(ry, y)p(8)p(rj - 8). Along with 
our conditional independence assumptions, this implies the full conditional 
distributions 


p{.Vi\Vi*i,9,y) « L(Vi',yi)p{Vi - 0*1 {Vi - (8.9) 

and 

p(0i\0j?i,V,y) oc p(0i)p(vi - 8i\{Vj - 0j}j*i) (8.10) 

for i=l,...,7. Since the likelihood informs about r]i directly in (8.9), 
overall sampler convergence is improved. We also gain the serendipitous 
side benefit of a closed form (normal) full conditional in (8.10), since the 
algebraically awkward Poisson likelihood component is no longer present. 


8.3.4 Model fitting, validation, and comparison 

Using the cycle of univariate Metropolis and Gibbs steps described in Sub¬ 
section 8.3.3, we ran 5 parallel, initially overdispersed MCMC chains for 
500 iterations, a task which took roughly 20 minutes on a Sparc 10 work¬ 
station. Graphical monitoring of the chains for a representative subset of 
the parameters, along with sample autocorrelations and Gelman and Ru- 
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bin (1992) diagnostics, indicated an acceptable degree of convergence by 
around the ] 00 tfl iteration. Using the final 400 iterations from all 5 chains, 
we obtained the 95% posterior credible sets (-1.10, -1.06), (0.00, 0.05), and 
(-0.27, -0.17) for a, 3, and respectively. The corresponding point esti¬ 
mates are translated into the fitted relative risks for the four subgroups in 
Table 8.4. It is interesting that the fitted sex-race interaction £ reverses the 
slight advantage white men hold over nonwhite men, making nonwhite fe¬ 
males the healthiest subgroup, with a relative risk nearly four times smaller 
than either of the male groups. Many Ohio counties have very small non¬ 
white populations, so this result could be an artifact of our inability to 
model covariate-region interactions. 


demographic 

subgroup 

contribution 
to Sjk 

fitted log- 
relative risk 

fitted 

relative risk 

white males 

0 

0 

1 

white females 

a 

-1.08 

0.34 

nonwhite males 

P 

0.02 

1.02 

nonwhite females 

a + /?-K 

-1.28 

0.28 


Table 8.4 Fitted relative risks for the four socio-demographic subgroups in the 
Ohio lung cancer data. 

Turning to the spatio-temporal parameters, histograms of the sampled 
values (not shown) showed 0* * * distributions centered near 0 in most cases, 
but distributions typically removed from 0. This suggests some degree 
of clustering in the data, but no significant additional heterogeneity beyond 
that explained by the CAR prior. Figures 8.13 and 8.14 investigate this issue 
a bit further by checking for differential heterogeneity and clustering effects 
over time. For example. Figure 8.13 plots the estimated posterior medians 
for the clustering parameters A t versus t. The solid line in the plot is the 
least squares regression fit, while the dashed line is the result of a Tukey- 
type running median smoother (smooth in S-plus). A clear, almost linear 
increase is observed, suggesting that the spatial similarity of lung cancer 
cases is increasing over the 21-year time period. On the other hand, the 
posterior medians for r t plotted versus t in Figure 8.14 are all quite near 
the prior mean of 100 (again suggesting very little excess heterogeneity) 
and provide less of an indication of trend. What trend there is appears 
to be downward, suggesting that heterogeneity is also increasing over time 
(recall that r t is the precision in a mean-zero prior for 

Since under our model the expected number of deaths for a given sub¬ 
group in county i during year t is we have that the (inter- 
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Figure 8.13 Estimated posterior medians for \ t versus t, full model. 


nally standardized) expected death rate per thousand is 1000;/ exp(ft,jkt)- 
The first row of Figure 8.15 maps point estimates of these fitted rates for 
nonwhite females during the first (1968), middle (1978), and last (1988) 
years in our dataset. These estimates are obtained by plugging in the es¬ 
timated posterior medians for the pijkt parameters calculated from the 
output of the Gibbs sampler. The rates are greyscale-coded from lowest 
(white) to highest (black) into seven intervals: less than .08, .08 to .13, .13 
to .18, .18 to .23, .23 to .28, .28 to .33, and greater than .33. The second row 
of the figure shows estimates of the variability in these rates (as measured 
by the interquartile range) for the same subgroup during these three years. 
These rates are also greyscale-coded into seven intervals: less than .01, .01 
to .02, .02 to .03, .03 to .04, .04 to .05, .05 to .06, and greater than .06. 

Figure 8.15 reveals several interesting trends. Lung cancer death rates are 
increasing over time, as indicated by the gradual darkening of the counties 
in the figure's first row. But their variability is also increasing somewhat, as 
we would expect given our Poisson likelihood. This variability is smallest for 
high-population counties, such as those containing the cities of Cleveland 
(northern border, third from the right), Toledo (northern border, third 
from the left), and Cincinnati (southwestern corner). Lung cancer rates are 
high in these industrialized areas, but there is also a pattern of generally 
increasing rates as we move from west to east across the state for a given 
year. One possible explanation for this is a lower level of smoking among 
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Figure 8.14 Estimated posterior medians for r» versus t, full model. 


persons living in the predominantly agricultural west, as compared to those 
in the more mining and manufacturing-oriented east. Finally, we note that 
while our conclusions of increasing clustering and increasing heterogeneity 
over time may have at first glance seemed contradictory, they are confirmed 
by the fitted death rate maps. We see increasing evidence of clustering 
among the high rate counties, but with the higher rates increasing and the 
lower rates remaining low (i.e., increasing heterogeneity statewide). Again, 
since the higher rates tend to emerge in the poorer, more mountainous 
eastern counties, we have an indication of decreasing environmental equity 
over the last few decades. 

Regarding the possible impact of the Fernald facility. Figure 8.16 shows 
the fitted lung cancer death rates per 1000 population by year for white 
males in Hamilton county (which contains the facility), as well as Butler, 
Warren, and Clermont counties (which are adjacent to Hamilton). The 
statewide white male death rate is also plotted by year for comparison. (We 
focus on white male mortality since this is the demographic group most 
likely to have been affected at the facility during the potential exposure 
window in the 1950s.) We observe substantially elevated rates of lung cancer 
in Hamilton county. Rates in the adjacent counties are similar in magnitude 
to the statewide rate, but do seem to be increasing a bit more rapidly. In 
addition, the adjacent county with the highest rates overall is Butler, which 
is immediately north of Hamilton and second-closest to Fernald. 
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Figure 8.15 Posterior median and interquartile range (IQR) by county andyear, 
nonwhite female lung cancer death rate per 1000 population (see text for grey scale 
key). 


Model Validation, comparison, and selection 

The results of the previous subsection suggest that our full model (8.7) 
is overfitting the data, especially with respect to heterogeneity over time. 
As such, we now contemplate several reduced models, judging their fit 
relative to the full model using the expected predictive deviance (EPD) 
score (6.32). These models and their scores are shown in Table 8.5, along 
with the partitioning of the EPD into its likelihood ratio statistic (LRS) 
and predictive variability penalty (PEN) components. The first row of the 
table shows the full model and its scores; the second and third rows show 
scores for the two models that eliminate the entire collection of spatial 
and heterogeneity effects, respectively. Not surprisingly, the latter change 
leads to no significant increase in the overall score, but the former change 
also seems to have only a small impact (the EPD scores for these three 
models are essentially equivalent to the order of accuracy in our MCMC 
calculations). Apparently, despite being centered around a common mean 
Kg a prion, the ! parameters are able to adapt their posterior levels to 
account for the increasing cancer rate over time. 
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Figure 8.16 Estimated white male lung cancer death rates per 1000 by year, 
statewide a\>erage and counties near the Femald Materials Processing Center. 


The next row of Table 8.5 considers the model 


pijkt — £jk T T &i j (8.11) 

which eliminates the excess heterogeneity and space-time interaction ef¬ 
fects, but still allows a general form for the overall time trend, subject to 


model for p ljk t 

LRS 

PEN 

EPD 

e jk + ef ] + # 

5374.87 

5806.07 

11180.94 

£ jk + 0^ 

4941.32 

6017.06 

10958.39 

£ jk + 4>^ 

5180.25 

5808.73 

10988.98 

£jk + St T- </> s - 

7274.70 

5725.19 

12999.89 

ejk + 7 1 

9444.20 

4756.31 

14200.50 


Table 8.5 Likelihood ratio statistics (LRS), predictive variability penalties (PEN), 
and overall expected predicted deviance (EPD) scores, spatio-temporal Ohio lung 
cancer models. 
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an identifiability constraint such as = 0. We implemented this con¬ 

straint numerically at the end of each iteration by recentering the sampled 
S t values around 0 and also adjusting the (f> i values accordingly. The prior 
for the of is a similar, simplified version of that in (8.8) above, namely 
0i ~ CAR(X). Despite the smaller penalty associated with this more par¬ 
simonious model, the increase in the LRS leads to a larger (i.e. poorer) 
overall EPD score. This increase in score over the first three models is “sig¬ 
nificant" in the sense that it reoccurred in five other, independent MCMC 
runs. 

Finally, the last row of the table eliminates the spatial aspect of the model 
altogether, and specifies a linear form for the temporal trend. The resulting 
model has no random effects and only 5 parameters in total. Again we see 
that the decrease in the penalty term fails to offset the reduction in quality 
of fit. resulting in a further significant increase in the overall EPD score. 
Of course, many other spatio-temporal models might well be considered, 
such as other functional forms or perhaps an AR(1) structure for the 8 1 , 
multiplicative forms for the spatio-temporal interaction, and so on. 


8.3.5 Discussion 

With regard to the impact of the Fernald plant, our results are consistent 
with those obtained by Devine (1992) using a Gaussian spatial-only model. 
ITowever. they are difficult to interpret due to two confounding factors. 
First. Hamilton county is also home to Cincinnati, a large urban area. We 
have already seen in Figure 8.15 that elevated rates are often present in 
such areas. Second, our model does not incorporate information on smok¬ 
ing prevalence, the most important risk factor for lung cancer. While we 
currently have no county- or year-specific information on these two factors, 
they could conceivably be accounted for using census information (say, 
using population density and the number of cigarettes sold per capita, re¬ 
spectively). 

The difficulty in assessing the impact of Fernald points to two larger 
issues in the investigation of environmental equity. First, it illustrates the 
problem of determining "clusters" of disease and their "significance." In 
our dataset. Fernald was known in advance to be a possible source of con¬ 
tamination, so searching for high cancer rates in its vicinity was justified. 
However, individuals often scan maps like ours looking for "hot spots," and 
subsequently launch investigations to discover the "cause" of the unusually 
high rates. Of course, "data-snooping" is bound to uncover such hot spots, 
many of which will have no practical significance. Still, the detection of dis¬ 
ease clusters in the absence of prior information is a legitimate goal, and one 
which has a long history in the statistical literature (see e.g. Diggle, 1983, 
and Waller et al., 1994). Here again, the full propagation of uncertainty 
throughout all stages of the analysis (and associated smoothing of outlying 
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Figure 8.17 Census block groups arid 10-km windrose near the FMPC site, with 
1990 population density by block group and 1980 structure density by cell (both 
in counts per km 2 )■ 


rates) afforded by the Bayesian approach suggests it as most appropriate 
for attacking this difficult problem. 

A second general issue raised by Fernald is that of misalignment be¬ 
tween the three data sources involved: exposure, disease incidence, and 
socio-demographic covariates. Exposures are typically available either by 
geographic regions within a particular assessment domain, or perhaps at 
certain precise geographic coordinates - say, corresponding to monitoring 
stations or a reported toxic release. Disease incidence is instead typically 
available only as summary counts or rates over geographic regions (like 
our county-wide Ohio cancer rates). Socio-demographic covariates are also 
usually available as summary counts, but they may be defined over differ¬ 
ent geographic regions (e.g., census tracts or zip codes). Computer-based 
geographic information systems (GIS), software programs that blend color 
mapping with modern database systems, are extremely effective tools for 
handling data misalignment, but they serve only as descriptive tools for 
summarizing the data, and do not at present allow statistical inference. 

The GIS display shown in Figure 8.17 illustrates this issue in the context 
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of our Fernald case study, displaying information on two geographically 
misaligned regional grids. The first is an exposure "windrose," consisting 
of 10 concentric circular bands at 1-kilometer radial increments divided into 
16 compass sectors (N, NNW, NW, WNW, W, etc.). Using a risk model 
that accounts for air and groundwater flows near the plant, Killough et al. 
(1996) provide cumulative radon exposure estimates for each of the 160 cells 
in the windrose by gender and quinquennial age group (0-4, 5-9, . . . , 35-39) 
and quinquennial time period (1950-1954, ..., 1985-1989). If we knew the 
number of persons at risk in each of these cell-gender-age-time categories, 
we could simply multiply them by the corresponding risk estimates and 
sum the results to produce expected numbers of observed lung cancers due 
to plant exposure. 

Unfortunately, the source of information for persons at risk is what gives 
rise to the second regional grid shown in Figure 8.17. These irregularly 
shaped regions are the boundaries of 39 U.S. Census Bureau block groups, 
for which 1990 population counts are known. Population density (persons 
per km 2 is shown using greyscale shading in the figure. The intersection 
of the two non-nested zonation systems results in 389 regions called atoms, 
which can be aggregated appropriately to form either cells or block groups. 

In order to reallocate the population counts from the block group grid to 
the exposure cell grid, we may take advantage of a covariate available on 
the latter. This is the cell-specific residential structures count, obtained 
by overlaying the windrose onto U.S. Geological Survey (USGS) aerial 
maps. Structure density (also in counts per km 2 ) is shown using hatch¬ 
ing in Figure 8.17. Using this covariate, Bayesian model-based allocation 
of the population counts from the block groups to the atoms (and sub¬ 
sequently reaggregated to the cells) should perform better than a simple 
area-based interpolation, as well as allow full posterior inference (rather 
than merely point estimates). 

Figure 8.18 shows the resulting estimated population density (posterior 
medians) by cell. Without going into detail here, the approach uses con¬ 
ditionally independent Poisson-multinomial models that incorporate the 
structure covariate as well as regional areas and the known constraints on 
the atom-level structure counts in a single cell (or the atom-level popula¬ 
tion counts in a single block group). Spatial correlation is modeled using 
a CAR prior structure, and age- and gender-specific allocations are also 
available. Of special concern are the spatial "edge zones," atoms result¬ 
ing from block groups extending outside the windrose (and thus for which 
no structure covariate information is available). See Mugglin et al. (2000) 
for modeling and implement ational details. See also Wolpert and Ickstadt 
(1998) for more on Bayesian methods for handling data misalignment, and 
Best et al. (2000) for a related case study in childhood respiratory illness. 
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APPENDIX A 


Distributional catalog 


This appendix provides a summary of the statistical distributions used in 
this book. For each distribution, we list its abbreviation, its support, con¬ 
straints on its parameters, its density function, and its mean, variance, and 
other moments as appropriate. We also give additional information helpful 
in implementing a Bayes or EB analysis (e.g., relation to a particular sam¬ 
pling model or conjugate prior). We denote density functions generically 
by the letter p, and the random variable for the distribution in question by 
a Roman letter: x for scalars, x for vectors, and V for matrices. Parame¬ 
ter values are assigned Greek letters. The reader should keep in mind that 
many of these distributions are most commonly used as prior distributions, 
so that the random variable would itself be denoted as a Greek character. 

In modern applied Bayesian work, a key issue is the generation of random 
draws from the distributions in this catalog. Since whole textbooks (De- 
vroye, 1986; Ripley, 1987; Gentle, 1998) have been devoted to this subject, 
we make no attempt at a comprehensive treatment here, but only provide a 
few remarks and pointers. First, many modern Bayesian software packages 
(such as BUGS and others described in Appendix Section C.3) essentially 
"hide" the requisite sampling from the user altogether. High-level (but still 
general purpose) programming environments (such as S-Plus, Gauss, or 
MOW) feature functions for generating samples from most of the stan¬ 
dard distributions below. Lower-level languages (such as Fortran or C) 
often include only a £/m'/orm(0, 1) generator, since samples from all other 
distributions can be built from these. In what follows, we provide a lim¬ 
ited number of generation hints, and refer the reader to the aforementioned 
textbooks for full detail. 


A.l Discrete 

A.1.1 Univariate 


Binomial: A' ~ Bin(n, 9), x — 0,1,2,..., n, 0 < 0 < 1, n any positive 
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integer, and 

p(x\n,())= -0) n ~* , 

where 

/ n\ n\ 

\x J x\(n — x)! 

E(X) = n6 and Var{X ) = nO{ 1 - 9). 

The binomial is commonly assumed in problems involving a series of 
independent, success/failure trials all having probability of success 6 
(Bernoulli trials): X is the number of successes in n such trials. The 
binomial is also the appropriate model for the number of "good" units 
obtained in n independent draws with replacement from a finite popula¬ 
tion of size N having G good units, and where 9 = G/N , the proportion 
of good units. 

Poisson: X ~ Po(0), x = 0,1,2,..., 6 > 0, and 


p(x\9) 


e~ e 9 x 

T I 


E(X) = 9 and Var(X) = 9. 

The Poisson is commonly assumed in problems involving counts of "rare 
events" occurring in a given time period, and for other discrete data 
settings having countable support. It is also the limiting distribution for 
the binomial when n goes to infinity and the expected number of events 
converges to 9. 

• Negative Binomial: X ~ NegBin(r,9), x = 0, 1,2.0 < 9 < 1, r 

any positive integer, and 

9 r (l-0) x . 

E(X) = r(l - 9)/9 and Var(X) = r(l - 9)/9 2 . 

The name "negative binomial" is somewhat misleading; "inverse bino¬ 
mial" might be a better name. This is because it arises from a series 
of Bernoulli trials where the number of successes, rather than the total 
number of trials, is fixed at the outset. In this case, X is the number 
of failures preceding the r th success. This distribution is also sometimes 
used to model countably infinite random variables having variance sub¬ 
stantially larger than the mean (so that the Poisson model would be 
inappropriate). In fact, it is the marginal distribution of a Poisson ran¬ 
dom variable whose rate 9 follows a gamma distribution (see Section A.2 
below). 

• Geometric: X ~ Geom(9) = NegBin{ 1,9), x = 0,1, 2,..., 0 < 9 < 1, 
and 

p(x\9) = 0(1-0)* . 


p(x\r, 9) = 


x + r 
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E(X) = (1 - eye and Var(X) = (1 - 0)/0 2 . 

The geometric is the special case of the negative binomial having r = 1; 
it is the number of failures preceding the first success in a series of 
Bernoulli trials. 


A. 1.2 Multivariate 


9 Multinomial: X ~ Mult(n,9) for X = (rq ,...,Xk)', where Xi € 
{0 . 1 , 2 ,..., nj and ^* : =| Xi = n, 9 = (9\,..., 6 k)' where 0 < 0; < 1 

and Ya =i 0» = 1» and 


p(x|n,0) 



n»r. 


E(X{) = n9i, Var(Xi) = n9i( 1 — 0,), and Cov(Xi,Xj) = —n9i6j. 

The multinomial is the multivariate generalization of the binomial: note 
that for k = 2, Mult(n, 0 = Bin(n, $ with 0 = 9\ = 1 - 62 - For 

a population having k >2 mutually exclusive and exhaustive classes, 
Xi is the number of elements of class i that would be obtained in n 
independent draws with replacement from the population. Note well that 
this is really only (T-l)-dimensional distribution, since Xk = n— x i 

and 0* = 1 — Yli=l 9i ■ 


A.2 Continuous 

A.2.1 Univariate 

• Beta: X ~ Beta(a, f3), x 6 [0,1], a > 0, f3 > 0, and 


p( x \ a > P) = 


r (a + (3) 


— 1 


(1 - x} 


J-l 


T(a)T(/i) 

where T() denotes the gamma function, defined by the integral equation 


riOO 

T(a)= / y a ~ l e~ y dy , a > 0 . 

Jo 


Using integration by parts, we have that T(a) = (a — l)T(a — 1), so 
since T(l) = 1, we have that for integer a, 

T(a) = (a - 1)! . 

One can also show that T(l/2) = ^/tt. 

For brevity, the pdf is sometimes written as 

1 


„a — 1 


B(a,pY 


(1 - x) 


9-1 
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where B(-, •) denotes the beta function, 


B(a,ft) 


n<*w) 

r(« + /?) ' 


Returning to the distribution itself, we have E(X ) = a/(a + ft) and 
Var(X) = a/?/[(a + /l) 2 (a + ft + 1)]; it is also true that E(X 2 ) = a(a + 
1)/[(« + /?)(a + ft + 1)]. The beta is a common and flexible distribution 
for continuous random variables defined on the unit interval (so that 
rescaled and/or recentered versions may apply to any continuous variable 
having finite range). For a < 1, the distribution has an infinite peak at 0; 
for ft < 1, the distribution has an infinite peak at 1. For a — ft — 1, the 
distribution is flat (see the next bullet point). For a and ft both greater 
than 1, the distribution is concave down, and increasingly concentrated 
around its mean as a + ft increases. The beta is the conjugate prior for 
the binomial likelihood; for an example, see Subsection 2.3.4. 

Uniform: X ~ Unif(0 L ,0u), x £ [0l,0u], -oo < 0 L < 9 V < oo, and 


p(x\6 L ,0u) = 


1 

Ou - Ol 


E(X) = (0 L +0 V )/2 and Var(X) = (Ou-0 L ) 2 / 12. Clearly Urn f (0,1) = 
Beta( 1,1). 

• Normal (or Gaussian): X ~ N(n,<r 2 ), x £ 3?, -oo < /i < oo, a 2 > 0, 
and 

p(x\n,a 2 ) = —i^exp 
Liter 

E(X) = p. and Var(X) = cr 2 . The normal is the single most common 
distribution in statistics, due in large measure to the Central Limit The¬ 
orem. It is symmetric about p, which is not only its mean, but its median 
and mode as well. 

A 7V(0, 1) variate Z can be generated as yZ-2 log([/'i )co.s(27r(7 2 ), where 
fft and U -2 are indep endent Unif( 0, 1) random variates; in fact, the 
quantity \J—‘2 log( b\ ).s in(‘2itU 2 ) produces a second, independent /V(0,1) 
variate. This is the celebrated Box -Midler (1958) method. A N(p,<r 2 ) 
variate can then be generated as p + crZ. 

• Double Exponential (or LapIacefX ~ DE(p, a 2 ), x £ 3ff, -00 < p < 
00 , a 2 > 0, and 



p(x\p,a 2 ) = exp 

E(X) = p and Var(X) = 2cr 2 . 

Like the normal, this distribution is symmetric and unimodal, but has 
heavier tails and a somewhat different shape, being strictly concave up 
on both sides of p. 
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• Logistic: X ~ Logistic (p, a 2 ), x £ 5ft, —oo < p < oo, a 2 > 0, and 


p(z|/i,<r 2 ) 


exp(-^) 

<r[ l + exp(-^)] J 


E(X) = p and Lar(X) = (7 t 2 /3)<t 2 . 

The logistic is another symmetric and unimodal distribution, more sim¬ 
ilar to the normal in appearance than the double exponential, but with 
even heavier tails. 


• t (or Student's t): X ~ t{v, p, a 2 ), x £ 5ft, v > 0, — oo < p < oo, <7 2 > 0, 
and 


p(x\n,a 2 ,p) = 


r[(^ + i)/ 2 ] 

CT^/irKY(l>/2) 



X — fj, 


2 -|-(V + l)/2 


i?(X) = p (if > 1) and Var(X) = va 2 /{v — 2) (if v > 2). The 
parameter v is referred to as the degrees of freedom and is usually taken 
to be a positive integer, though the distribution is proper for any positive 
real number v. The t is a common heavy-tailed (but still symmetric and 
unimodal) alternative to the normal distribution. The leading term in 
the pdf can be rewritten 


T[(i/+l)/2] = 1 

(r^/vfY{v/2) ay/vB{\,%) 


where B(., .) again denotes the beta function. 

• Cauchy: X ~ Cau(fi,o 2 ) = t(l,^,<r 2 ), x £ 9t, -oo < \x < oo, <r 2 > 0, 
and 


p{x\fi,cr 2 ) 


(TIT 


1 


E(X) and Var{X) do not exist, though p is the median of this distri¬ 
bution. This special case of the t distribution has the heaviest possible 
tails and provides a wealth of counterexamples in probability theory. 


• Gamma: X ~ G(a,/?), x > 0, a > 0, (3 > 0, and 


x a-l e -x/f3 

T(a)/? a . ' 

E(X) = a/3 and Var(X) = a(3 2 . Note also that if X ~ G(a,/3), then 
Y = cX ~ G(a,c/3). 

The gamma is a flexible family for continuous random variables defined 
on the positive real line. For a < 1, the distribution has an infinite peak 
at 0 and is strictly decreasing. For a = 1, the distribution intersects the 
vertical axis at 1//3, and again is strictly decreasing (see the Exponential 
distribution below). For a > 1, the distribution starts at the origin. 


p(x\a,/3) 
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increases for a time and then decreases; its appearance is roughly normal 
for large a. The gamma is the conjugate prior distribution for a Poisson 
rate parameter 9. 

• Exponential: .A ~ Expo(j3 ) = G(l,/3), x > 0, [3 > 0, and 

p(x\(3) = . 

E(X) = (3 and Var(X) = f3 2 . 

• Chi-square: X ~ y 2 (jy) = G{vj 2, 2), x > 0, v > 0, and 

x v/2-l e -r/2 

r{xW) = wm* ■ 

E{X) = v and Var(X) = 2v, As with the t distribution, the parameter 
u is referred to as the degrees of freedom and is usually taken to be a 
positive integer, though a proper distribution results for any positive 
real number j/. 


• Inverse Gamma: X ~ IG(a, /?), x > 0, a > 0, (3 > 0, and 


p(x|a,/?) 


r(a)/?"a ; a + 1 


E'jX) = l/[/?(a— 1)] (provided a> 1) and VafX) = l/[/? 2 (o — l) 2 (cv — 
2)] (provided a > 2). Note also that if A ~ IG(a,(3), then Y = cA' ~ 
IG(a,f3/c). 

A better name for the inverse gamma might be the reciprocal gamma, 
since X = l/Y where Y ~ G’(q', 0) (or in distributional shorthand, 
IG(ot,/3) = \/G{a,(3)). 

Despite its poorly behaved moments and somewhat odd, heavy-tailed ap¬ 
pearance, the inverse gamma is very commonly used in Bayesian statis¬ 
tics as the conjugate prior for a variance parameter a 2 arising in a normal 
likelihood function. Choosing a and /? appropriately for such a prior can 
be aided by solving the above equations for p = E(X) and r 2 = Var(X) 
for a and (3. This results in 


a = ( p/r ) 2 + 2 and (3 = 


1 


{p/r)- + 1 




Setting the prior mean and standard deviation both equal to p (a rea¬ 
sonably vague specification) thus produces a = 3 and /3 = 1/(2 p). 
Inverse Gaussian (or Wcild):X~ InvGau(p,\), x > 0, p > 0, A > 0, 
and 


p{x\p,\) 


E(X) = p and Var(X) = p 3 / A. 


/ A 

X(x — p) 2 

V2«3 ex P 

: 3 /A. 

2 p 2 x 
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The inverse Gaussian has densities that resemble those of the gamma 
distribution. Despite its somewhat formidable form, all its positive and 
negative moments exist. 


A.2.2 Multivariate 


• Dirichlet: X ~ D(a), X = (n,..., x k y where 0 < x, < 1 and 
SLi = 1, « = , ctk)' where ck* > 0, and 


p(x|a) 


r(tto) 

nLiUo,) 


IK'" 


E(Xi) — oii/ao, Var(Xi) = [q i(a 0 - n,)]/[a^(a 0 + 1)], and finally 
Cov(Xi,Xj) = -(ociaj)/[otl(a 0 + 1)], where q 0 = Ya=\ Q i- 
The Dirichlet is the multivariate generalization of the beta: note that 
for k = 2, D(a) — Beta(oti, a 2 )- Note that this is really only (k - 1)- 
dimensional distribution, since Xk — 1 - J2i=i x i- The Dirichlet is the 
conjugate prior for the multinomial likelihood. It also forms the foun¬ 
dation for the Dirichlet process prior, the basis of most nonparametric 
Bayesian inference (see Section 2.5). 

• Multivariate Normal (or Multinormal, or Multivariate Gaussian): 

X ~ N k (p,E), x G p = (pi,...,p k )' where — oo < Pi < oo, E 
a k x. k positive definite matrix, and 


p(x\p,S) 


X7 _1 1 1/2 

(2^)W eXP 


’(x- p) 


where \E~ l 1 denotes the determinant of 27 _1 . E(X) = p (i.e., E(Xi ) = 
Pi for all i ), and Var(X) = E (i.e., Var(Xi) = an and Cov(Xi, Xj) = 
aij where E = (crij)). The multivariate normal forms the basis of the 
likelihood in most common linear models, and also serves as the conju¬ 
gate prior for mean and regression parameters in such likelihoods. 

A N k (0, 1) variate z can be generated simply as a vector of k indepen¬ 
dent, univariate N(0, 1) random variates. For general covariance ma¬ 
trices E, we first factor the matrix as E = LL', where L is a lower- 
triangular matrix (this is often called the Cholesky factorization; see for 
example Thisted, 1988, pp.81-83). A N k (p, E) variate z can then be 
generated as p + Lz. 

• Multivariate t (or Multi-t): X ~ t k (v.u. 27). x <= Sf* v '> 0 ##. = 
(pi,.. ,,p k y where —oo < pi < oo, E b. k x k positive definite matrix, 
and p{*\v,p, E) is 


IT 1 | 1/a r[(i/ + fe)/2] 

(vit) k l 2 T(v/2) 


1 + -(x - p)'E x (x-/r) 
v 


~(v+k)/2 
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E(X) = n (if v > 1), and Ear(X) = vE/{v - 2) (if a > 2). The 
multivariate t provides a heavy-tailed alternative to the multivariate 
normal while still accounting for coiTelation among the elements of x 
Typically, £ is called a scale matrix, and is approximately equal to the 
variance matrix of X for large v. 

• Wishart: V ~ W (ft, v), Y a k x k symmetric and positive definite 
matrix, fl a k xk symmetric and positive definite parameter matrix 
v > 0, and 


IV |( 1 '—t )/ 2 

exp 




provided the shape parameter (or "degrees of freedom") y > k. Here 
the proportionality constant c takes the awkward form 


2(i'k)/2 7r k(k-l)/4 JT p 

3 = 1 


13+1-j 



E(Vij) = vilij, Var(Vij ) = + fi uiljj), and finally Cov(Vij,V k i) = 

The Wishart is a multivariate generalization of the gamma, originally 
derived as the sampling distribution of the sum of squares and crossprod¬ 
ucts matrix., ^”_ 1 (X ! -X)(Xi-X) / , where X t - AT(/i. V). In Bayesian 
analysis, just as the reciprocal of the gamma (the inverse gamma) is of¬ 
ten used as the conjugate prior for a variance parameter in a normal 
likelihood, the reciprocal of the Wishart (the inverse Wishart ) is often 
used as the conjugate prior for a variance-covariance matrix X in a mul¬ 
tivariate normal likelihood (see Example 5.6). 


A random draw from a Wishart with v an integer can be obtained using 
the idea in the preceding paragraph: If ,..., x„ are independent draws 
from a N k ( 0 , fl) distribution, then v = YlUi x * x < is a u) random 
variate. Non-integral v requires the general algorithm originally given by 
Odell and Feiveson (1966); see also the textbook treatment by Gentle 
(1998, p.107), or the more specifically Bayesian treatment by Gelfand et 
al. (1990). 
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APPENDIX B 


Decision theory 


B.l Introduction 


As discussed in Chapters 1, 3, and 4, the Bayesian approach with low- 
information priors strikes an effective balance between frequentist robust¬ 
ness and Bayesian efficiency. It occupies a middle ground between the two 
and, when properly tuned, can generate procedures with excellent frequen¬ 
tist properties, often "beating the frequentist approach at its own game." 
As an added benefit, the Bayesian approach more easily structures compli¬ 
cated models and inferential goals. 

To make comparisons among approaches, we need a method of keeping 
score, commonly referred to as a loss structure. Loss structures may be 
explicit or implicit; here we discuss the structure adopted in a decision- 
theoretic framework. We outline the general decision framework, highlight¬ 
ing those features that suit the present purpose. We also provide examples 
of how the theory can help the practitioner. Textbooks such as Ferguson 
(1967) and DeGroot (1970) and the recent essay by Brown (2000) provide 
a more comprehensive treatment of the theory. 

Setting up the general decision problem requires a prior distribution, a 
sampling distribution, a class of allowable actions and decision rules, and a 
loss function. Specific loss function forms correspond to point estimation, 
interval estimation, and hypothesis testing. We use the notation 


prior distribution: 
sampling distribution: 
allowable actions: 
decision rules: 
loss function: 


G(0), dee 
f(xjd) 
a eA 
deV:X 
1(6 , a) . 


(B.l) 


The loss function /( 9 a) computes the loss incurred when 6 is the true state 
of nature and we take action a. Thus for point estimation of a parameter 
9. we might use squared error loss (SEL), 


1(6, a) = (6-a) 2 
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or weighted squared error loss (WSEL), 


1 (0, a) = w(9)(6 — a) 2 , 


or absolute error loss, 

1 (6, a) = \9 — a 

or, for discrete parameter spaces, 0-1 loss. 



if 6 = a 
if 9 rf a 


(B.2) 


The decision rule d maps the observed data x into an action a; in the case 
of point estimation, this action is a proposed value for 9 (e.g., d(x ) =x 
The Bayesian outlook on the problem of selecting a decision rule is as 
follows. In light of the data x, our opinion as to the state of nature is 
summarized by the posterior distribution of 9, which is given by Bayes' 
Theorem as 


= f(x\9)dG(9) 
m G (x) 


where 



f(x\u)dG(u) . 


The Bayes rule minimizes the posterior risk. 


p(G, d(x)) = E t \n\l(e, d.(x))} = j 1(9, d(x))dG(9\x) . (B.3) 

Note that posterior risk is a single number regardless of the dimension of 
9, so choosing d to minimize p(G, d(x)) is well defined. 


B.1.1 Risk and admissibility 

In the frequentist approach, we have available neither a prior nor a posterior 
distribution. For a decision rule d, define its frequentist risk (or simply risk ) 
for a given true value of 9 as 

R(0,d) = £x|0[J(0> £ *( x ))] = J /(0,d(x))/(x|0)dx , (B.4) 

the average loss, integrated over the distribution of X conditional on 0. 
Note that frequentist risk is a. function of 9 not a single number like the 
posterior risk (B.3). 

Example B.l In the interval estimation setting of Example 1.1, consider 
the loss function 

fO if 9 € 8(x 
| 1 if 9 <£ 6(x) 
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Then the risk function is 


R(8, d) — -Ex| 0 >o -2 [1(0; <5(X))] 

= Px\b,AQ 6(X>] 

= .05, 

which is constant over all possible values of 0 and cr 2 . ■ 

This controlled level of risk across all parameter values is one of the main 
selling points of frequentist confidence intervals. Indeed, plotting R versus 
0 for various candidate rules can be very informative. For example, if d\ 
and d ‘2 are two rules such that 

R(Q, d x ) #R(9.d 2 ) for all 0, 

with strict inequality for at least one value of 9. then under the given loss 
function we would never choose d 2 , since its risk is never smaller that d\ ’s 
and can be larger. In such a case, the rule d 2 is said to be inadmissible, and 
dominated by d\. Note that di may itself be inadmissible, since there may 
be yet another rule which uniformly beats d 2 . If no such rule exists, r/ x is 
called admissible. 

Admissibility is a sensible and time-honored criterion for comparing de¬ 
cision rules, but its utility in practice is rather limited. While inadmissible 
rules often need not be considered further, for most problems there will be 
many admissible rules, creating the problem of which one to select. More¬ 
over, admissibility is not a guarantee of sensible performance; the example 
in Subsection 4.2.1 provides an illustration. Therefore, additional criteria 
are required to select a frequentist rule. We consider the three most impor¬ 
tant such criteria in turn. 

B.1.2 Unbiased rules 

Unbiasedness is a popular method for reducing the number of candidate 
decision rules and allowing selection of a "best" rule within the reduced 
class. A decision rule d(x) is unbiased if 

Ex\e[l( 8 ',d(x))\ > E x \ e [l{9,d(x ))] for all 9 and 9' . (B.5) 

Under squared error loss, (B.5) is equivalent to requiring E x \g[d(x)] = 9. For 
interval estimation, unbiasedness requires that the interval have a greater 
chance of covering the true parameter than any other parameter. For hy¬ 
pothesis testing, unbiasedness is equivalent to the statistical power being 
greater under an alternative hypothesis than under the null. 

Though unbiasedness has the attractive property of reducing the number 
of candidate rules and allowing the frequentist to find an optimal rule, it 
is often a very high price to pay - even for frequentist evaluations. Indeed, 
a principal thesis of this book is that a little bias can go a long way in 
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improving frequentist performance. In Section B.2 below we provide the 
basic example based on squared error loss. Here we present three estima¬ 
tion examples of how unbiasedness can work against logical and effective 
inference. 

Example B.2 : Estimating P(no events) Ferguson (1967) provides 
the following compelling example of a problem with unbiased estimates. 
Assume a Poisson distribution and consider the goal of estimating the prob¬ 
ability that there will be no events in a time period of length 2 1 based on 
the observed number of events, X, in a time period of length t. Therefore, 
with A the Poisson parameter we want to estimate e~ 2tx , based on X which 
is distributed Poisson(f A). The MLE of this probability is e~ 2X , while the 
best (indeed, the only) unbiased estimate is ( —l) x . This latter result is 
derived by matching terms in two convergent power series. 

This unbiased estimate is patently absurd; it is either +1 or -1 depending 
on whether an even or an odd number of events has occurred in the first t 
time units. Using almost any criterion other than unbiasedness, the MLE 
is preferred. For example, it will have a substantially smaller MSE. ■ 

Example B.3: Constrained parameter space. In many models the 
unbiased estimate can he outside the allowable parameter space. A common 
example is the components of variance models (i.e., the Type II ANOVA 
model), where 

1?' j — [l T bi -f- £ij , i — 1 j • • -) k i j 1, . . ., Tli , 

where p G 3?, 6, ~ d A(0,r 2 ), and tij yV(0,<r 2 ). The unbiased estimate 
of r 2 , computed using the within and between mean squared errors, can 
sometimes take negative values (as can the MLE). Clearly, performance of 
these estimates is improved by constraining them to be nonnegative. ■ 

Example B.4: Dependence on the stopping rule. As with all fre¬ 
quentist evaluations, unbiasedness requires attention to the sampling plan 
and, if applicable, to the stopping rule used in collecting the data. Consider 
again Example 1.2, where now we wish to estimate the unknown success 
probability based on x observed successes in n trials. This information 
unambiguously determines the MLE as x/n, but the unbiased estimate 
depends on how the data were obtained. If n is fixed, then the MLE is 
unbiased. But if the sampling were "inverse" with data gathered until 
failures were obtained, then the number of observed successes, x, is nega¬ 
tive binomial, and the unbiased estimate is , - . = . If both n and x 

are random, it may be impossible to identify an unbiased estimate. ■ 


B.1.3 Bayes mles 

While many frequentists would not even admit to the existence of a sub¬ 
jective prior distribution G(9), a possibly attractive frequentist approach 
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to choosing a best decision rule does depend on the formal use of a prior. 
For a prior G, define the Bayes risk as 

r(G, d) = E e E xlt l(ff, d(x)) = E e R(0, d) , (B.6) 

the expected frequentist risk with respect to the chosen prior G. Bayes risk 
is alternatively known as empirical Bayes risk because it averages over the 
variability in both 0 and the data. Reversing the order of integration in 
(B.6), we obtain the alternate computational form 

r(G,d ) = E x E m l(0,d(x)) = E x p(G,d(x)), 

the expected posterior risk with respect to the marginal distribution of 
the data X. Since this is the posterior loss one expects before having seen 
the data, the Bayes risk is sometimes referred to as the preposterior risk. 
Thus r(G, d) is directly relevant for Bayesian experimental design (see Sec¬ 
tion 4.5). 

Since r(G, d )is a scalar quantity, we can choose the rule f ] G that mini¬ 
mizes the Bayes risk, i.e., 

d G (x) = arg min r(G,d) . (B.7) 

This minimizing rule is called the Bayes rule. This name is somewhat con¬ 
fusing, since a subjective Bayesian would not average over the data (as 
r(G, d) does) when choosing a decision rule, but instead choose one that 
minimized the posterior risk (B.3) given the observed data. Fortunately, 
it turns out that these two operations are virtually equivalent: under very 
broad conditions, minimizing the Bayes risk (B.6) is equivalent to mini¬ 
mizing the posterior risk (B.3) for all x such that me(x) > 0. For further 
discussion of this result and related references, see Berger (1985, p.159). 

Example B.5: Point estimation. Under SEL the Bayes rule is found 
by minimizing the posterior risk, 

P(G, a) = J (0 - afg(0\x)d0 . 

Taking the derivative with respect to a, we have 

7 ^[p(G, a)] = j 2(0- a)(-l)g(0\x)d0 . 

Setting this expression equal to zero and solving for a, we obtain 

a = J 0 g{0\x)d6 = E(0\x) , 

the posterior mean, as a solution. Since 

^2 [p(G, a)\ - 2 j g(0\x)d0 = 2 > 0 , 
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the second derivative test implies our solution is indeed a minimum, con¬ 
firming that the posterior mean is the Bayes estimate of 0 under SEL. 

We remark that under absolute error loss, the Bayes estimate is the 
posterior median, while under 0-1 loss, it is the posterior mode. ■ 

Example B.6: Interval estimation. Consider the loss function 

1(9, a) = I{»g a } + c x volume(a), 

where I is the indicator function of the event in curly brackets and a is a 
subset of the parameter space 0. Then the Bayes rule will be the region 
(or regions) for 0 having highest posterior density, with c controlling the 
tradeoff between the volume of a and the posterior probability of coverage. 
Subsection 2.3.2 gives a careful description of Bayesian confidence inter¬ 
vals; for our present purpose we note (and Section 4.3 demonstrates) that 
under noninformative prior distributions, these regions produce confidence 
intervals with excellent frequentist coverage and modest volume. ■ 

Example B.7: Hypothesis testing. Consider the comparison of two 
simple hypotheses. Ho '■ 9 — 9o and H i : 9 = 9i, so that we have a two- 
point parameter space 0 = \0 (h 0j}. We use the “0-/ f ” loss function 

1(9, a) = al 0 + (1 - a)li, 

where both la and 1, are nonnegative, and a £ {0,1} gives the index of the 
accepted hypothesis. Then it can be shown that the Bayes rule with respect 
to a prior on 7r = P(9 = 9 0 ) i s a likelihood ratio test; in fact, many authors 
use this formulation to prove the Neyman-Pearson lemma. This model is 
easily generalized to a countable parameter space. ■ 

B.1.4 Minimax rules 

A final (and very conservative) approach to choosing a best decision rule 
in a frequentist framework is to control the worst that can happen: that is, 
to minimize the maximum risk over the parameter space. Mathematically, 
we choose the rule d* satisfying the relation 

sup R(9, d*) = inf sup R(6, d) . 

0£@ d £T > 8£@ 

This d* may sacrifice a great deal for this control, since its risk may be 
unacceptably high in certain regions of the parameter space. The minimax 
rule is attempting to produce controlled risk behavior for all 0 and x, but 
is making no evaluation of the relative likelihood of the 0 values. 

The Bayesian approach is helpful in finding minimax rules via a theorem 
which states that a Bayes rule (or limit of Bayes rules) that has constant 
risk over the parameter space is minimax. Generally, the minimax rule is 
not unique (indeed, in Chapter 3 we find more than one), and failure to find 
a constant risk Bayes rule does not imply that the absence of a minimax 
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rule. Of course, as with all other decision rules, the minimax rule depends 
on the loss function. For the Gaussian sampling distribution, the sample 
mean is minimax (it is constant risk and the limit of Bayes rules). Finding 
minimax rules for the binomial and the exponential distributions have been 
left as exercises. 


B.2 Procedure evaluation and other unifying concepts 

B.2.1 Mean squared error 

To see the connection between the Bayesian and frequentist approaches, 
consider again the problem of estimating a parameter under squared er¬ 
ror loss (SEL). With the sample X = (A’i, Xo..... X n ). let d(X) be our 
estimate of 6. Its frequentist risk is 

£ X | S [/(M(x))] = E x \ e [(9- d(x)) 2 ] = MSE d {6) , 

the mean squared error of d given the true 0. Its posterior risk with respect 
to some prior distribution G is 

£„| X [/(M(x))] = E m [{0 - d(x)) 2 ] = MSE d , G (x ) , 

and its preposterior risk is 

Ee,x[l(9, d(x))] = E 0tX [(0 - d(x)) 2 ] = MSE d>G , 

which equals Eg[MSE d (6)\ and E X [M SE diG (x)]. We have not labeled d 
as "frequentist” or "Bayesian;" it is simply a function of the data. Its 
properties can be evaluated using any frequentist or Bayesian criteria. 


B.2.2 The variance-bias tradeoff 


We now indicate how choosing estimators having minimum MSE requires a 
tradeoff between reducing variance and bias. Assume that conditional on 9 , 
the data are i.i.d. with mean 9 and variance cr 2 {9). Let d(x) be the sample 
mean, X = — y' n . X,, and consider estimators of 9 of the form 

d c (x) = cX . (B.8) 


For 0 < c < 1, this estimator shrinks the sample mean toward 0 and has 
the following risks: 


frequentist: 
posterior: 
preposterior (Bayes): 


\9) 


+ (i -cfe 2 

n, 

Var{9 |x) + [cx - E(9 |x)] 2 , 
c 2 ^^f(l-cf {Var (0)f { E(9)] 2 }, 
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where in the last expression the expectations and variances are with respect 
to the prior distribution G. Notice that all three expressions are of the form 
"variance + bias squared." 

Focusing on the frequentist risk, notice that the first term decreases like 
— and the second term is constant in n. If 0 = 0 or is close to 0, then a c near 

n 

0 will produce small risk. More generally, for relatively small n, it will be 
advantageous to set c to a value betweeen 0 and 1 (reducing the variance at 
the expense of increasing the bias), but, as n increases, c should converge 
to 1 (for large a, bias reduction is most important). This phenomenon 
underlies virtually all of statistical modeling, where the standard way to 
reduce bias is to add terms (degrees of freedom, hence more variability) to 
the model. 

Taking c = 1 produces the usual unbiased estimate with MSE equal to 
the sampling variance. Comparing c = 1 to the general case shows that if 

nO 2 - <r 2 (0) 
n0 2 +<r 2 (0) <c - ’ 

then it will be advantageous to use the biased estimator. This relation can 
be turned around to find the interval of Os around 0 for which the estimator 
beats X. This interval has endpoints that shrink to 0 like . 

Turning to the Bayesian posterior risk, the first term also decreases like 
i, and the Bayes rule (not restricted to the current form) will be E(6\x). 
The c that minimizes the preposterior risk is 

n{Var(0) + {E{0)\ 2 } 

C ~ Cn ~ n {Var(6) + [£(0)] 2 } + E[^(0)} ' 

As with the frequentist risk, for relatively small n it is advantageous to use 
c ^ 1 and have c n increase to I as n increases. 

Similar results to the foregoing hold when an estimator (B.8) is aug¬ 
mented by an additive offset (i.e., d a c (x) = a + cX ). We have considered 
SEL and MSE for analytic simplicity; qualitatively similar conclusions and 
relations hold for a broad class of convex loss functions. 

Shrinkage estimators of the form (B.8) arise naturally in the Bayesian 
context. The foregoing discussion indicates that they can have attractive 
frequentist properties (i.e., lower MSE than the standard estimator, X). 
Under squared error loss, one wants to strike a tradeoff between variance 
and bias, and rules that effectively do so will have good properties. 

B.3 Other loss functions 

We now evaluate the Bayes rules for three loss functions that show the 
effectiveness of Bayesian structuring and how intuition must be aided by the 
rules of probability. Chapter 4 and Section 7.1 present additional examples 
of the importance of Bayesian structuring. 
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B.3.1 Generalized absolute loss 
Consider the loss function for 0 < p < 1, 

p\9 — a\ if a < 9 
(1 - p)\9 - a| if a > 9 . 

It is straightforward to show that the optimal estimate is the p th percentile 
of the posterior distribution. Therefore, p = .5 gives the posterior median. 
Other values of p are of more than theoretical interest, as illustrated below 
in Problem 3. 


B.3.2 Testing with a distance penalty 

Consider a hypothesis testing situation wherein there is a penalty for mak¬ 
ing an incorrect decision and, if the null hypothesis is rejected, an addi¬ 
tional penalty that depends on the distance of the parameter from its null 
hypothesis value. The loss function 

_ f 0 , a = 0, 9 < 0 or a = 1, 9 > 0 
\ 0 2 , a = 0,0>Oora = l,#<O 

computes this score. Assume the posterior distribution is N(p, r 2 ). Then 
R(9, a = 1) - R(9, a = 0) = r 2 [l - 2*(p/r)] + ^<f>(p/r) , 

T 

where the first term corresponds to 0-1 loss, while the second term corre¬ 
sponds to the distance adjustment. If p > 0, 0-1 loss implies a = 1, but 
with the distance adjustment, if r < 1, then for small p > 0, a = 0. This 
decision rule is non-intuitive, but best. 


B.3.3 A threshold loss function 

Through its Small Area Income and Poverty Estimates (SAIPE) project, 
the Census Bureau is improving estimated poverty counts and rates used 
in allocating Title I funds (National Research Council, 2000). Allocations 
must satisfy a "hold-harmless" condition (an area-specific limit on the re¬ 
duction in funding from one year to the next) and a threshold rule (e.g., con¬ 
centration grants kick in when estimated poverty exceeds 15%). ACS (2000) 
demonstrates that statistical uncertainty induces unintended consequences 
of a hold-harmless condition, because allocations cannot appropriately ad¬ 
just to changing poverty estimates. Similarly, statistical uncertainty induces 
unintended consequences of a threshold rule. Unintended consequences can 
be ameliorated by optimizing loss functions that reflect societal goals as¬ 
sociated with hold-harmless, threshold, and other constraints. However, 
amelioration will be limited and societal goals may be better met through 
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eligible for 

condition concentration funds? loss 


9 >T yes (9 — a) 2 

9 <T no a 2 


TableB.lA threshold loss function. 


replacing these constraints by ones designed to operate in the context of 
statistical uncertainty. 

Example B.8 To show the quite surprising influence of thresholds on op¬ 
timal allocation numbers, we consider a greatly simplified, mathematically 
tractable example. Let 0 be the true poverty rate for a single area (e.g., a 
county), a be the amount allocated to the area per child in the population 
base, and Y denote all data. For a threshold T, consider the societal loss 
function in Table B.8. The optimal per-child allocation value is then 

ar(Y) = E(9 | 8 > T, Y) x pr(0 > T | Y) . 

Though the allocation formula as a function of the true poverty rate (0) 
has a threshold, the optimal allocation value has no threshold. Furthermore, 
for T > 0, ax(Y) is not the center of the the posterior distribution (neither 
the mean, median, nor mode). It is computed by multiplying the posterior 
mean (conditional on 9 > T) by a posterior tail area. This computation 
always produces an allocation value smaller than the posterior mean; for 
example, if the posterior distribution is exponential with mean /./, qt — 
n(l + r)e~ r where r — Tjpi. But this reduction is compensated by the lack 
of a threshold in ay ■ It is hard to imagine coming up with an 
procedure in this setting without Bayesian structuring. 

Actual societal loss should include multiplication by population size, in¬ 
corporate a dollar cap on the total allocation over all candidate areas, and 
"keep score" over multiple years. Absolute error should replace squared- 
error in comparing a to 9. ■ 

B.4 Multiplicity 

There are several opportunities for taking advantage of multiple analyses, 
including controlling error rates, multiple outcomes, multiple treatments, 
interim analyses, repeated outcomes, subgroups, and multiple studies. Each 
of these resides in a domain for control of statistical properties, including 
control for a single analysis, a set of endpoints, a single study, a collection 
of studies, or even an entire research career! It is commonly thought that 
the Bayesian approach pays no attention to multiplicity. For example, as 
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a 

Za/2 


s /—2 log a 

.01 

2.57 

2.81 

3.03 

.02 

2.33 

2.57 

2.80 

.05 

1.96 

2.24 

2.45 

.10 

1.65 

1.95 

2.15 


Table B.2 Comparison of upper univariate confindence limits, multiple compar¬ 
isons setting with K = 2. 


shown in Example 1.2, the posterior distribution does not depend on a 
data monitoring plan or other similar aspects of an experimental design. 
However, a properly constructed prior distribution and loss function can 
acknowledge and control for multiplicity. 

Example B.9 When a single parameter is of interest in a K-v ariate pa¬ 
rameter structure, the Bayesian formulation automaticallly acknowledges 
multiplicity. For the K = 2 case, consider producing a confidence interval 
for 0i when: 8 — , # 2 ) ~ iV^O, 1-z). The (1 - a HPD region is 

||0|| 2 < — 21og(a) . 

Projecting this region to the first coordinate gives 

- \J—2 log(a) < 0\ < s/-2 log(a) . 

Table B.2 shows that for K = 2 the Bayesian HPD interval is longer than 
either the unadjusted or the Bonferroni adjusted intervals (is the Bon- 
ferroni non-coverage probability). That is, the Bayesian interval is actually 
more conservative! However, as K increases, the Bayesian HPD interval 
becomes narrower than the Bonferroni interval. ■ 


B.5 Multiple testing 

B.5.1 Additive loss 

Under additive, component-specific loss, Bayesian inference with compo¬ 
nents that are independent a priori separately optimizes inference for each 
component with no accounting for the number of comparisons. However, 
use of a hyperprior Bayes (or empirical Bayes) approach links the compo¬ 
nents, since the posterior "borrows information" among components. The 
&-ratio /-test is a notable example. With F denoting the F-test for a one¬ 
way ANOVA, 

F = (* 2 + tfr 3 )/* a 
(1 ~B) = (F — 1 )/F 
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Jf , {F-1\ 1/2 Y 1 -Y 2 

Z12 = ( — ) 

The magnitude of F adjusts the test statistic. For large K, under the global 
null hypothesis Hq : /’(all Zp = 0) > 0.5, the rejection rate is much smaller 
than 0.5. Note that the procedure depends on the estimated prior mean (pi) 
and shrinkage ( B) and the number of candidate coordinates can influence 
these values. If the candidate coordinates indicate high heterogeneity, then 
B is small as is the Bayesian advantage. Therefore, it is important to con¬ 
sider what components to include in an analysis to control heterogeneity 
while retaining the opportunity for discovery. 


B.5.2 Non-additive loss 

If one is concerned about multiplicity, the loss function should reflect this 
concern. Consider a testing problem with a loss function that penalizes 
for individual coordinate errors and adds an extra penalty for making two 
errors: 


parameters: #i ,$2 G {0,1} 

decisions: o,x,a 2 E {0, 1} 

/(a, 0) = di(1 — 0i)£ o + (1 — 

+ a 2 (l — 9 2 )£o + (1 — a 2 )9 2 ti 
+ 7(1 - 0i)(l - 9 2 )a 1 a 2 . 

With to = = 1 and Gp — P(0\ = i, 0 2 — j data), a straightforward 

computation shows that the risk is 

a t(l — 2Gi+) + a 2 ( 1 — 2G+i) + 7 aia 2 Goo + {Gi+ + G+i} . 

The four possible values of (m, a 2 ) then produce the risks in Table B.3, 
and the the Bayes decision rule (the (a ls a 2 ) that minimize risk) is given in 
Table B.4. Note that in order to declare the second component “a 2 = l,’ 1 
we require more compelling evidence than if y = 0. 


ai 

a 2 

risk — {Gi+ + G+i} 

0 

0 

0 

1 

0 

1 — 2Gi + 

0 

1 

1 - 2G + i 

1 

1 

(1 — 2Gi + ) + (1 — 2Gi+) + jGqo 


Table B.3 Risks for the possible decision rules. 
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condition 

optimal rule 



G 1+ < .5,G+i < .5 

ai = 0, a 2 = 0 



Gi + < .5, G .|-i > .5 

at = 0, a 2 = 1 



Gi+ > .5, G+i < .5 

ai = 1, a 2 = 0 



G\+ > G+i > .5 

i / °. 

ai — 1) «2 — | ^ 

if (2G+i 
if (2G+i 

- 1 ) < 7 G 00 

- 1 ) > 7 G 00 


Table B.4 Optimal decision rule for non-additive loss. 

Discussion 

Statistical decision rules can be generated by any philosophy under any 
set of assumptions. They can then be evaluated by any criteria - even 
those arising from an utterly different philosophy. We contend (and much 
of this book shows) that the Bayesian approach is an excellent "procedure 
generator," even if one's evaluation criteria are frequentist. This somewhat 
agnostic view considers the parameters of the prior, or perhaps the entire 
prior, as "tuning parameters" that can be used to produce a decision rule 
with broad validity. Of course, no approach automatically produces broadly 
valid inferences, even in the context of the Bayesian models. A procedure 
generated assuming a cocksure prior that turns out to be far from the truth 
will perform poorly in both the Bayesian and frequentist senses. 


B.6 Exercises 


1. Show that the Bayes estimate of a parameter 9 under weighted squared 
error loss, /(0, a ) = w(9)(9 — a) 2 , is given by 


dir(x) 


£[0te(0)|x] 


2. Suppose that the posterior distribution of 9, p(0\x), is discrete with 
support points \0 l ,# 2 , • ■ Show that the Bayes rule under 0-1 loss 
(B.2) is the posterior mode. 

3. Consider the following loss function: 

l(9a) = ( P\ e ~ a l d > a 
' \ (1 — p)\9 — aj, 9 < a 

Show that the Bayes estimate is the 100 x p ih percentile of the posterior 
distribution. 
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(Note: The posterior median is the Bayes estimate for p = .5. This 
loss function does have practical application for other values of p. In 
many organizations, employees are allowed to set aside pre-tax dollars 
for health care expenses. Federal rules require that one cannot get back 
funds not used in a 12-month period, so one should not put away too 
much. However, putting away too little forces some health care expenses 
to come from post-tax dollars. If losses are linear in dollars (or other 
currency), then this loss function applies with p equal to the individual's 
marginal tax rate.) 


4. 


For the binomial distribution with n trials, find the minimax rule for 
squared error loss and normalized squared error loss, L(6, a) = jpppfy 


(Hint: The rules are Bayes rules or limits of Bayes rules.) 

For each of these estimates, compute the frequentist risk (the expected 
loss with respect to f(x\6)), and plot this risk as a function of 6 (the 
risk plot). Use these risk plots to choose and justify an estimator when 
6 is 


(a) the head probability of a randomly selected coin; 

(b) the failure probability for high-quality computer chips; 

(c) the probability that there is life on other planets. 

5. For the exponential distribution based on a sample of size n, find the 
minimax rule for squared error loss. 

6. Find the frequentist risk under squared error loss (i.e., the MSE) for the 
estimator d a c (x) = a + cX . What true values of 9 favor choosing a small 
value of c (high degree of shrinkage)? Does your answer change if n is 
large? 

7. Let X be a random variable with mean ;t and variance <x 2 . You want to 
estimate //, under SEL, and propose an estimate of the form (1 - b)X. 


(a) Find b*, the b that minimizes the MSE. 

(Hint: Use the "variance + (bias) 2 ” representation of MSE.) 

(b) Discuss the dependence of b * oil// and a 2 and its implications on the 
role of shrinkage in estimation. 


8. Consider a linear regression having true model of the form Y; = a + 
f3xi + 6,. where the e, are i.i.d. with mean 0 and variance a 2 . Suppose 
you have a sample of size n, use least-squares estimates (LSEs), and 
want to minimize the prediction error: 

PE = E[Y new - Y(X new f] . 

Here, ~Y{x) is the prediction based on the estimated regression equation 
and X new is randomly selected from a distribution with mass ~ at each 
of the x\,.. . , x n . 
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(a) Assume that J2i x i — 0 and show that even though you may know 

that 8 ^ 0, if x\ is sufficiently small, it is better to force /? = 0 
than to use the LSE for it. 

(Note: This result underlies the use of the C p statistic and PRESS 
residuals in selecting regression variables; see the documentation for 
SAS Proc REG.) 

(b) (More difficult) As in the previous exercise, assume you plan to use 
a prediction of the form Y(x) = d + (1 — b)j3x. Find b*, the optimal 
value of b. 

9. In an errors-in-variables simple regression model, the least squares es¬ 
timate of the regression slope (/?) is biased toward 0, an example of 
attenuation. Specifically, if the true regression (through the origin) is 
Y — x/3 + e, but Y is regressed on X, with X = x + $, then the least 

<7 2 

squares estimate (/?) has expectation: E[j8\ ss p[3, with p = - 2 < 1. 

If p is known or well-estimated, one can correct for attenuation and 
produce an unbiased estimate by using /?/p to estimate /?. However, this 
estimate may have poor MSE properties. To minimize MSE, consider an 
estimator of the form: = c(8/p). 

(a) Let <t 2 = Var[8\ and find the c that minimizes MSE. 

(b) Discuss the solution's implications on the variance/bias tradeoff and 
the role of shrinkage in estimation. 

10. Compute the HPD confidence interval in Example B.9 for a general 
K, and compare its length to the unadjusted and Bonferrom adjusted 
intervals. 
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APPENDIX C 


Software guide 


Many Bayesians argue that the continuing emphasis on classical rather 
than Bayesian methods in actual practice is less a function of philosoph¬ 
ical disagreement than of the lack of reliable, widely-available software. 
While the largest commercial statistical software packages by and large do 
not provide Bayesian procedures, a large number of hierarchical modeling 
and empirical Bayes programs are commercially available. Moreoever, the 
number of fully Bayesian packages continues to burgeon, with many avail¬ 
able at little or no cost. This appendix provides a brief guide to many of 
these software packages, with information on how to obtain some of the 
noncommercial products. 

We caution the reader at the outset that we make no claim to provide 
full coverage, and that guides like this one go out of date very quickly. Few 
of the packages listed in the exhaustive review of Bayesian software by Goel 
(1988) remain in common use today. Packages evolve over time; some are 
expanded and combined with other packages, while others disappear alto¬ 
gether. Developers' addresses sometimes change, or primary responsibility 
for a package passes to a new person or group (e.g., when a successful non¬ 
commercial package makes the jump to commercial status, as in the case 
of S and S-plus ). 

Many of the programs listed below are freely and easily available elec¬ 
tronically via the World Wide Web (WWW, or simply "the web"). Many 
of these are contained in StatLib, a system for the electronic distribution of 
software, datasets, and other information. Supported by the Department 
of Statistics at Carnegie Mellon University, StatLib can be accessed over 
the web at 


http://lib.stat.emu.edu/ 

The reader may also wish to consult the first author's webpage, 
http://www.biostat.umn.edu/~brad 

for more recently updated information on available software, as well as 
Bayes and EB inference in general. Finally, the recent overview by Berger 
(2000) includes an appendix with additional Bayesian software websites. 
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C.l Prior elicitation 

While computation of posterior and predictive summaries is the main com¬ 
puting task associated with applied Bayes or empirical Bayes analysis, it is 
certainly not the first. Prior distributions for the model parameters must 
be determined, and in many cases this involves the elicitation of prior opin¬ 
ion from decision makers or other subject-matter experts. Subsection 2.2.1 
showed that this can be an involved process, and a number of computer 
programs for elicitation have been described in the literature. 

Linear regression models 

Perhaps the first such program of this type was reported by Kadane et al. 
(1980). These authors developed an interactive program for eliciting the 
hyperparameters of a conjugate prior distribution for the standard regres¬ 
sion model (i.e., one having linear mean structure and normally distributed 
errors). Implemented on an experimental computer system called TROLL 
at the Massachusetts Institute of Technology, the program and its platform 
look fairly primitive by today's standards. However, the work showed that 
such programs were feasible, and contributed the important idea of eliciting 
prior opinion on potentially observable quantities (such as predicted values 
of the response variable) instead of the regression parameters themselves, 
about which the elicitee may have little direct feeling. 

An updated Fortran version of the above approach is described in Kadane 
and Wolfson (1996), and freely available from SiaiLib at 

http: //lib.stat.emu.edu/general/elicit-normlin/ 

Kadane, Chan, and Wolfson (1996) describe an extension of these methods 
to the case of AR(1) error structure. See also 

http://lib.stat.emu.edu/general/elicit-diric/ 

for similar code for eliciting a prior on the concentration parameter in a 
Dirichlet process prior. Finally, elicitation programs for several Bayesian 
models are available on Prof. Wolfson's software page, 

http: //statweb.byu.edu/1jw/elicitation/ 

Use of dynamic graphics 

Dr. Russell Almond developed a program called ElToY for eliciting Bayesian 
conjugate prior distributions through constraint-based dynamic graphics. 
Written in XLISP-STAT, the program does one parameter elicitation for 
univariate conjugate priors through dynamic graphics. It contains a tool for 
interactively displaying a univariate distribution, as well as a demonstration 
of the central limit theorem. Copies are available over the web at 

http://lib.stat.emu.edu/xlispstat/eltoy/ 
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C.2 Random effects models/ Empirical Bayes analysis 
SAS Proc MIXED 

Proc MIXED is a SAS procedure for analyzing linear models with both fixed 
and random effects. As we have seen, these models are particularly useful 
for the analysis of repeated measures data from clinical trials and cohort 
studies. The basic form dealt with is exactly that of equation (7.10), namely, 

Yi = X,-a + W ift + Ci , i = 1,..., n, 

where a is the vector of fixed effects and the are vectors of subject-specific 
random effects, assumed to be independent and normally distributed with 
mean vector 0 and covariance matrix The are also assumed indepen¬ 
dent (of each other and of the and identically distributed as normals 

with mean 0 and covariance matrix Thus, marginally we have that 

Var(Yi) = W 8 VW'. + £ , 

so that even if £ is taken as <j?I (as is customary), the components of Yi 
are still marginally correlated. 

Proc MIXED essentially works with the marginal model above, obtain¬ 
ing point estimates of the model parameters by maximum likelihood, im¬ 
plemented via the Newton-Raphson algorithm. Individual level point and 
interval estimates for the parameters /?,• and predicted values Y t - are also 
available. While this approach is essentially classical, if one thinks of the 
random effects distribution as a prior, it is the same as empirical Bayes. 
Proc MIXED addresses the problem of undercoverage of "naive" individual 
level confidence intervals (see Section 3.5) by using t-distributions with 
appropriately selected degrees of freedom. 

Several forms for the covariance matrices V and £ are available, rang¬ 
ing from a constant times the identity matrix (conditional independence) to 
completely unstructured. Popular intermediate choices include compound 
symmetry (constant correlation; equivalent to a random intercept) and var¬ 
ious time-series structures where correlation depends on the temporal or 
spatial distance between observations. The package provides the data ana¬ 
lyst with a rich array of modeling options for the mean vector, the random 
effects and the "time series" structures. 

Another SAS procedure, NLMIXED, fits nonlinear mixed models; that is, 
models in which both fixed and random effects enter nonlinearly. These 
models have a wide variety of applications, two of the most common being 
pharmacokinetics and overdispersed binomial data. Proc NLMIXED enables 
you to specify a conditional distribution for your data (given the random 
effects) having either a standard form (normal, binomial, Poisson) or a 
general distribution that you code using SAS programming statements. 

Proc NLMIXED fits nonlinear mixed models by maximizing an approxima¬ 
tion to the likelihood integrated over the random effects. Different integral 
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approximations are available, the principal ones being adaptive Gaussian 
quadrature and a first-order Taylor series approximation. A variety of alter¬ 
native optimization techniques are available to carry out the maximization; 
the default is a dual quasi-Newton algorithm. 

Proc NLMIXED is currently limited to only two-level models, and is also 
limited by its quadrature approach. However, two SAS macros are available 
to accommodate a much wider class of models. The first, GLIMMIX, allows 
fairly arbitrary generalized linear models, including repeated measures for 
binary or count data, and Poisson or logistic regression with random effects. 
It does this by iteratively applying Proc MIXED to a linear approximation 
to the model (see Wolfinger and O'Connell, 1993). The second, NLINMIX, 
is specifically set up to handle nonlinear random coefficient models. This 
macro works by iteratively calling Proc MIXED following an initial call to 
Proc NLIN. Wolfinger (1993) shows that this is mathematically equivalent 
to the second-order approximate method of Lindstrom and Bates (1990), 
which expands the nonlinear function about random effects parameters set 
equal to their current empirical best linear unbiased predictor. The macros 
are available at 

http://www.sas.com/techsup/download/stat/ 

in the files beginning with "glmm" and "nlmm" prefixes. 

Finally, we point out one fully Bayesian capability of Proc MIXED, first 
documented in SAS Release 6.11. Independent pseudo-random draws from 
the joint posterior distribution of the model parameters are available in the 
case of what are often referred to as variance components models. Searle, 
Casella, and McCulloch (1992) provide a comprehensive review of these 
models; an outstanding Bayesian treatment is given in Section 5.2 of Box 
and Tiao (1973). More specifically, Wolfinger and Rosner (1996) illustrate 
this capability for the hierarchical model 

^ijki = p + ati + /3j + 7 Xijki + Uijk + tijki , 

where the Uijk and tijki are both independent, mean-zero normal errors 
with variances cr 2 and a 2 . respectively. Using the PRIOR statement to spec¬ 
ify the prior on these two parameters (flat or Jeffreys), Proc MIXED then 
specifies flat priors for the remaining (fixed) effects, and integrates them 
out of the model in closed form. After improving the symmetry of the 
remaining bivariate posterior surface via linear transformation, the pro¬ 
gram generates posterior values for a 'i and a 2 e via an independence chain 
algorithm which uses a product of inverse gamma densities as its base 
sampling function. Since the conditional posterior of the remaining param¬ 
eters given a 2 and <j 2 is multivariate normal, the full posterior sample is 
easily completed. This sample can be output to a SAS dataset, and subse¬ 
quently analyzed using Proc UNIVARIATE, Proc CAPABILITY, Proc KDE2D 
(for two-dimensional kernel density estimates and contour plots), or the 
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exploratory data analytic SAS/INSIGHT software. For more details and 
examples, refer to Wolfinger and Kass (2000). General information and 
updates about the preceding SAS software is available at the company's 
webpage at 

http: //www.sas.com/rnd/app/Home.html 

Other commercial packages 

Several other software packages for fitting random effects and other mixed 
models are available commercially. Kreft et al. (1994) provide a review of 
five of these packages: BMDP-5V, GENMOD, HLM, ML3, and VARCL . These au¬ 
thors comment specifically on how these five packages compare with respect 
to ease of use, documentation, error handling, execution speed, and output 
accuracy and readability. Zhou et al. (1999) offer a similar comparative 
review of four packages that can handle discrete response data (general- 
ized linear mixed models): SASProc MlXED(via the old Gliiranixmacro), 
MLwiN, HLM, and VARCL. Here, we merely sketch some of the features of 
these packages. 

Statistical and biostatistical readers will likely be most familiar with 
BMDP-5V, a procedure in the BMDP biomedical data analysis package ini¬ 
tiated in the 1960s at the University of California. Like Proc MIXED, 5V 
is intended for the analysis of repeated measures data, and contains many 
of the same modeling features, such as a generous collection of options for 
the forms of the variance matrices V and £. It also allows several unusual 
designs, such as unbalanced or partially missing data (imputed via the EM 
algorithm), and time-varying covariates (see Schluchter, 1988). However, 
at present it has no capability for fully Bayesian posterior analysis. 

The remaining packages are targeted to specific groups of practitioners. 
GENMOD is popular among demographers, the group for whom it was orig¬ 
inally developed. HLM is widely used by social scientists and educational 
researchers in the U.S. (where it was developed), while ML3 is a similar 
program more popular in the U.K. (where it was developed by Prof. Har¬ 
vey Goldstein and colleagues; c.f. Goldstein, 1995). VARCL, also developed 
in the U.K., is also used by educational researchers. These PC-based pro¬ 
grams tend to be smaller and less expensive than SAS or BMDP. 

ML3 has been replaced by MLn, which expands the program's modeling 
capability from three to an arbitrary number of levels, and by MLwiN, the 
Windows successor to MLn. The program is distributed by the Multilevel 
Models Project at the Institute of Education, University of London. Their 
web homepage can be accessed at 

http://multilevel.ioe.ac. uk/ 

MLwiN now offers the capability for fully Bayesian analysis in a limited 
class of models under diffuse priors using MCMC methods. Specifically, 
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Gibbs sampling is used with normal responses, and Metropolis-Hastings 
sampling for normal and binary proportion responses; capability for general 
binomial and Poisson response models will be forthcoming shortly. Sample 
trace plots, kernel density estimates, and several convergence diagnostics 
are also implemented. For information, see 

http: //www.ioe.ac.uk/mlwin/mcmc.html/ 

MLwiN, HLM, and VARCL are also distributed by ProGamma, an interuni¬ 
versity expertise center which develops and distributes a very wide range of 
computer applications for the social and behavioral sciences. Information 
on obtaining these and many other popular programs is available at 

http: / /www. gamma. rug. ni/ 

Noncommercial packages 

Noncommercial packages (freeware available from their developers or dis¬ 
tributed over computer networks) continue to be developed to fit hierarchi¬ 
cal and empirical Bayes models. In the case of likelihood analysis, one of 
the best documented and easiest to use alternatives is nlme, a collection of 
S functions, methods, and classes for the analysis of linear and nonlinear 
mixed-effects models written by Drs. Douglas Bates, Jose Pinheiro, and 
Mary Lindstrom. See 

http: //nlme. stat. wise. edu/ 

for a full description. Both the S and the C source code are freely available 
from this website. The commercial vendor MathSoft does ship this code 
with various versions of S-plus, but the most up-to-date code is always 
that from the above web site. For more on S-plus itself, see 

http://www.insightful.com/products/default.asp 

and the book by Pinheiro and Bates (2000) for the connection to nlme. 

A similar approach is provided by the MIXOR and MIXREG programs by 
Drs. Donald Hedeker and Robert D. Gibbons. These programs include 
mixed-effects linear regression, mixed-effects logistic regression for nominal 
or ordinal outcomes, mixed-effects probit regression for ordinal outcomes, 
mixed-effects Poisson regression, and mixed-effects grouped-time survival 
analysis. For more information, see 

http: //www.uic.edu/~hedeker/mix.html 

Perhaps the most noteworthy current program for empirical Bayes anal¬ 
ysis is PRIMM, an acronym for Poisson Regression Interactive Multilevel 
Model. PRIMM is an S-plus function for analyzing a three-stage Poisson 
regression model under the usual log link function. The program uses a 
relatively vague hyperprior at the third model stage so that the resulting 
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estimates of the regression and individual-specific parameters generally en¬ 
joy good frequentist and Bayesian properties, similar to our perspective in 
Chapter 4. Some aspects of the hyperprior may be input by the user or 
simply set equal to data-based default values, corresponding to a Bayes or 
EB analysis, respectively. In the latter case, an extension of Morris' EB 
interval adjustment method presented in Subsection 3.5.1 is used to avoid 
the undercoverage associated with naive EB confidence intervals. 

PRlMM's computational method combines standard maximization rou¬ 
tines with clever analytic approximations to obtain results much more 
quickly (though somewhat less accurately) than could be achieved via 
MCMC methods. Christiansen and Morris (1997) give a full description of 
the model, prior distributions, computational approach, and several sup¬ 
porting data analyses and simulation studies. The S code is available on 
StatLib at 

http://lib.stat.cmu.edU/S/priiran . 


C.3 Bayesian analysis 

A considerable number of Bayesian software packages have emerged in the 
last ten to twenty years. The emergence rate steadily increases in tandem 
with advances in both computing speed and statistical methodology. Early 
attempts (from the 1970s and 80s) typically handled computer limitations 
by reducing the package's scope, often by limiting the dimension of the 
problem to be solved. Later, some packages began to focus on a particular 
class of models or problems, thus limiting the computational task somewhat 
while simultaneously heightening their appeal among practitioners in these 
areas. Some packages aim specifically at persons just learning Bayesian 
statistics (and perhaps computing as well), sacrificing breadth of coverage 
for clarity of purpose and ease of use. Most recently, a few general purpose 
Bayesian analysis engines have begun to emerge. Such generality essentially 
forces the package to employ MCMC computing methods, which as we 
know from Section 5.4 carry an array of implementational challenges. 

A noteworthy early development in Bayesian computing was sbayes, a 
suite of S functions for approximate Bayesian analysis. Developed by Prof. 
Luke Tierney and recently updated by Prof. Alex Gitelman, sbayes com¬ 
putes approximate posterior moments and univariate marginal posterior 
densities via Laplace's method (second order approximation; see Subsec¬ 
tion 5.2.2). First order (normal) approximation and importance sampling 
are also available as computational methods. The user must input an S 
function to compute the joint log-posterior (up to an additive constant), 
an initial guess for the posterior mode, optional scaling information, and 
the data. The program finds the true joint posterior mode via a modified 
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Newton algorithm, and then proceeds with Laplace's method. The program 
is freely available from StatLib, at 

http : //lib.stat.emu.edu/S/sbayes/ 

Alternatively, an XLISP-STA T version is available as part of the core 
XLISP-STAT system, an extendible environment for statistical comput¬ 
ing and dynamic graphics based on the Lisp language. This system is also 
available from StatLib at 

http://lib.stat.emu.edu/xlispstat/ 

The Bayes functions in XLISP-STAT are documented in Section 2.8.3 of 
Tierney (1990). 

Another recent program worth mentioning is bayespack. Prof. Alan 
Genz's collection of Fortran routines for numerical integration and impor¬ 
tance sampling. Based on methods described in Genz and Kass (1997), this 
collection is available from the website 

http: //www.math.wsu.edu/math/faculty/genz/homepage 


C.3.1 Special purpose programs 
Time series software 

The BATS (Bayesian Analysis of Time Series ) program focuses on the set 
of dynamic linear and nonlinear models, including Kalman filter models. 
Originally written in the APL language, it was developed at the Univer¬ 
sity of Warwick and described in the paper by West, Harrison, and Pole 
(1987). Today, a substantially enhanced modular C version of BATS for Win- 
dows/DOS forms the computing basis for the popular Bayesian forecasting 
and time series book by Pole, West, and Harrison (1994). More information 
can found at 


http: //vrww. isds . duke . edu/~mw/bats . html 

For those interested in non-stationary time series analysis, software and 
documentation for time-varying autoregressive modeling and time series 
decompositon analyses are available at 

http://www.isds.duke.edu/~mw/tvar.html 

This is a preliminary version of the software used for analyses in a series of 
recent articles by Drs. Raquel Prado and Mike West. 

Relatedly, ARCOMPis a set of Fortran programs and S-pIus functions that 
implement MCMC methods for autoregressive component modeling with 
priors on the characteristic roots. The approach involves focus on underly¬ 
ing latent component structure, and handles problems of uncertainty about 
model order and possible unit roots. This software by Drs. Gabriel Huerta 
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and Mike West can be downloaded from 

http: / /www.isds.duke.edu/~mw/armodels.html 

Model selection software 

Several S-plus functions for Bayesian model selection and averaging are 
freely available from StatLib, at 

http: //lib.stat.emu.edu/S/ 

For instance, bicreg, bic. logit and glib, all written by Prof. Adrian 
Raftery of the University of Washington, use the BIC (Schwarz) criterion 
(2.21). The first does this in the context of linear regression models, the 
second for logistic regression models, and the third for generalized lin¬ 
ear models; see Raftery (1996). Dr. Jennifer Floeting offers bma, a pro¬ 
gram for model averaging in linear regression models. Finally, bic.glm and 
bic.surv, both contributed by Dr. Chris Volinsky, implement Bayesian 
model averaging using BIC, the first for generalized linear models and the 
second for Cox proportional hazards models. For more information, see also 

http: //www.research.att.com/~volinsky/bma.html 

Software for clinical trials and other biomedical data 

Spiegelhalter et al. (2000) list several websites concerned with methods and 
software for Bayesian methods in health technology assessment, especially 
clinical trial analysis. These include Prof. Lawrence Joseph's page, 

http: //www.epi.mcgill.ca/~web2/Joseph/softwarp.htm 

which gives a variety of programs for Bayesian sample size calculations, 
change-point methods, and diagnostic testing. 

On the commercial side, there is iBART {instructional Bayesian Anal¬ 
ysis of Randomized Trials), a new package for Windows 95, 98, and NT 
described on the web at 

http: //www.talariainc.com/ 

Loosely based on the original BART program written by Prof. David Spiegel¬ 
halter of the Medical Research Council (MRC) Biostatistics Unit at the 
University of Cambridge, iBART implements the Bayesian approach to clin¬ 
ical trial monitoring and analysis via an indifference zone, as described in 
Section 6.1.2 above. The software allows updating of normal and other 
more specialized survival likelihoods using standard prior distributions. 

Finally, Prof. Giovanni Parmigiani has developed BRCAPRO, a Bayesian 
software program for the genetic counseling of women at high risk for hered¬ 
itary breast and ovarian cancer. Available for PC and UNIX platforms, this 


© 2000 by CRC Press LLC 


program and related information is available at 

http://astor.som.jhmi.edu/brcapro/ 

Belief analysis software 

In the realm of belief analysis, Wooff (1992) describes a computer program 
called [B/D] , an acronym for Beliefs adjusted by Data. Developed at the 
Universities of Hull and Durham, the program implements the subjectivist 
Bayesian theory in which prevision (expectation), rather than probability, 
is the fundamental operation (see e.g. Goldstein, 1981). Given a partial 
specification of beliefs over a network of unknown quantities, the program 
checks them for internal consistency (coherence) and then adjusts each in 
the presence of the others and the observed data. [B/D] is written in Pascal, 
runs in either interactive or batch mode, and contains both progamming 
(e.g., if-then-else) and some statistical package (e.g., belief structure ad¬ 
justment) elements. However, it is primarily a tool for organizing beliefs, 
rather than a statistical package for doing standard analyses (e.g., linear 
regression). The program is freely available at 

http: //fourier . dur . ac . uk : 8000/stats/bd/. 

A related program is the Bayesian Knowledge Discoverer, BKD, a com¬ 
puter program designed to extract Bayesian belief networks from (possibly 
incomplete) databases. Based on the "bound and collapse" algorithm, the 
aim of BKD is to provide a knowledge discovery tool able to extract reusable 
knowledge from databases. Also, the Robust Bayesian Classifier, RoC, is a 
program able to perform supervised Bayesian classification from incom¬ 
plete databases, with no assumption about the pattern of missing data. 
Both programs are available at the Open University (U.K.) website 

http: //kmi.open.ac.uk/projects/bkd/ 

Expert systems software 

An important recent application of Bayesian methodology is to the field 
of expert systems, since automatic updating of the information available to 
a decision-maker given the current collection of data is most simply and 
naturally done via Bayes' Rule. Before describing the software packages 
available for this type of analysis, we must first briefly describe the modeling 
framework in which they operate. 

The probability and conditional independence structure of the variables 
within an expert system is often represented pictorially in a graphical model. 
In such a graph, nodes in the graph represent variables while directed edges 
between two nodes represent direct influence of one variable on the other. 
As a result, missing links represent irrelevance between nodes. To avoid 
inconsistencies, we do not allow a sequence of directed edges in the graph 
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that returns to its starting node (a cycle), so our graph is a directed acyclic 
graph, or DAG. Note that such a graph may be constructed in a given 
problem setting without reference to probability, since it requires only in¬ 
tuition about the causal structure in the problem. As such, DAGs are often 
easily specified by practitioners, and helpful in organizing thinking about 
the problem before getting bogged down in quantitative details. Spiegel- 
halter et al. (1993) provide a thorough review of developments in the use 
of graphical models and Bayesian updating in expert systems. 

To add the probability structure to the graphical model, we write V — 
{v} as the collection of nodes in the graph. Then redefining "irrelevance" 
as "conditional independence," we have that 

P{V) = n p(v|pa(t>)) , (C l) 

where pa(v) denotes the parent nodes (direct antecedents) of node v. Thus, 
the probability structure of the graph is determined by "prior" distributions 
for the nodes having no parents, and the conditional distributions for the 
remaining "children" given their parents. The marginal distribution of any 
node given any observed "data" nodes could then be found via Bayes' Rule. 

In the case where all of the v are given discrete distributions, the Bayesian 
updating can be carried out exactly using the algorithm of Lauritzen and 
Spiegelhalter (1988). This algorithm is perhaps best described as a "two- 
pass" algorithm, as the node tree is traversed in both directions when "col¬ 
lecting' and "distributing" evidence. Perhaps the most professional piece 
of software implementing this algorithm is Hugin, a Danish commercial 
product distributed by Hugin Expert A/S. (Incidentally and somewhat re¬ 
freshingly, Hugin is not an acronym, but rather the name of one of two 
ravens in Scandinavian mythology who brought back news from every cor¬ 
ner of the world to the god Odin.) The newest versions of the program allow 
influence diagrams (a graphical probability model augmented with decision 
and utilities) and chain graphs (a graphical model with undirected, as well 
as directed, edges); future enhancements will include parametric learning 
via the EM algorithm. Further information on obtaining Hugin can be ob¬ 
tained from the company's homepage, 

http: //hugin. dk/ 

Freeware for evaluation and updating of expert systems models is also 
available from several sources. A remarkably comprehensive list of the com¬ 
mercial and noncommercial software for processing probabilistic graphical 
models (as well as more general kinds of graphical belief function models) 
is provided by Dr. Russell Almond at the webpage 

http://bayes.stat.Washington.edu/almond/belief.html/ 

Finally, the members of the Decision Theory and Adaptive Systems 
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Group at Microsoft Research have developed a Bayesian belief network 
construction and evaluation tool called MSBN . It supports the creation and 
manipulation of Bayesian networks as well as their evaluation. Complete 
information about MSBN and how to obtain it can be downloaded from the 
website 

http: //fwww.research.microsoft.com/dtas/msbn/default.htm 

The program and its component parts are available free of charge for aca¬ 
demic and commercial research and education following the completion of 
an on-line licensing agreement. 

Neural networks software 

Neural networks have a close relation to expert systems. Algorithms for 
Bayesian learning in these models using MCMC methods were described 
by Neal (1996b). Software implementing these algorithms is freely available 
over the web at 

http://www.cs.toronto.edu/~radford/fbm.sofware.html 

The software supports models for regression and classification problems 
based on neural networks and Gaussian processes, and Bayesian mixture 
models. It also supports a variety of Markov chain sampling methods. 

C.3.2 Teaching programs 

First Bayes 

Written by Prof. Tony O'Hagan, First Bayes is a user-friendly package 
to help students to learn and understand elementary Bayesian statistics. 
Particular features include 

• Plotting and summarization of thirteen standard distributional families, 
and mixtures of these families; 

• Analysis of the one-way ANOVA and simple linear regression models, 
with marginal distributions for arbitrary linear combinations of the lo¬ 
cation parameters; 

• Automatic updating of posterior and predictive summaries following 
user-made changes to the data or prior distribution; 

• Sequential analysis, via a facility for setting the current posterior equal 
to the prior. 

The program is written in APL, a language to which the author displays 
the extreme devotion typical of APL users. It runs under Windows 3.1, 
95, and NT, and may be freely used and copied for any noncommercial 
purposes. It may be obtained over the web directly from the author at 

http: //www.shef.ac.uk/~stlao/lb.html/ 
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Minicab macros 

A recent book by Albert (1996) describes a set of macros written in Minitab, 
an easy-to-use statistical package widely used in U.S. elementary statis¬ 
tics courses. The macros are designed to perform all of the computations 
in Berry (1996), an elementary statistics text that emphasizes Bayesian 
methods. Included are programs to 

• Simulate games of chance; 

• Perform inference for one- and two-sample proportion (binomial) prob¬ 

lems; 

• Perform inference for one- and two-sample mean (normal) problems; 

• Perform inference for the simple linear regression model; 

• Illustrate Bayes tests and predictive inference; 

• Summarize arbitrary one- and two-parameter posterior distributions us¬ 

ing a variety of integration strategies, including sampling-importance 
resampling (weighted bootstrap), Laplace's method, adaptive quadra¬ 
ture, and the Gibbs and Metropolis samplers. 

Information is available with the two textbooks mentioned above, or 
directly from the author via the website 

http: //bayes .bgsu. edu/~albert/mini_bayes/info. html . 


C.3.3 Markov chain Monte Carlo programs 
Tools for convergence diagnosis 

Before describing the packages available for MCMC sampling-based Bayes¬ 
ian analysis, we provide information on obtaining the programs mentioned 
near the end of Chapter 5 for assessing whether the output from an MCMC 
sampler suggests convergence of the underlying process to stationarity. 
These programs generate no samples of their own, but merely operate on 
the output of a user-provided MCMC algorithm. 

First, the S program created by Andrew Gelman to implement the scale 
reduction factor (5.46) advocated as a convergence diagnostic by Gelman 
and Rubin (1992) is available over the web from StatLib at 

http: //lib. stat. emu. edu/S/itsim . 

The code computes both the mean and upper 95 th percentile of the scale 
reduction factor, which must approach 1 as n — oc. 

Second, Steven Lewis has written an implementation of the Raftery and 
Lewis (1992) diagnostic procedure, which attempts to find the number of 
samples from a single MCMC chain needed to estimate some posterior 
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quantile to within some accuracy with some high probability. A Fortran 
version of this code is available over the web from StatLib at 

http://lib.stat.emu.edu/general/gibbsit , 

while an S translation is located in 

http://lib.stat.emu.edu/S/gibbsit 

Both of the above S programs, as well as several other convergence diag¬ 
nostics, are also implemented in the coda, boa, and winBUGS packages; see 
below. 

BUGS, WinBUGS, CODA, and BOA 

The easy accessibility and broad applicability of the sampling based ap¬ 
proach to Bayesian inference suggests the feasibility of a general-purpose 
language for Bayesian analysis. Until recently, most of the MCMC com¬ 
puter code written was application-specific: the user simply wrote a short 
piece of Gibbs or Metropolis sampling code for the problem at hand, and 
modified it to fit whatever subsequent problems came along. Practition¬ 
ers have since recognized that certain high-level languages (such as S and 
XLISP-STAT) are quite convenient for the data entry, graphical conver¬ 
gence monitoring, and posterior summary aspects of the analysis, while 
lower-level, compiled languages (such as Fortran or C) are often needed to 
facilitate the enormous amount of random generation and associated loop¬ 
ing in their sampling algorithms. Of the two tasks, the lower-level portion 
is clearly the more difficult one, since the computer must be taught how to 
understand a statistical model, organize the prior and likelihood compo¬ 
nents, and build the necessary sampling distributions before sampling can 
actually begin. 

In this vein, the most general language developed thus far seems to be 
bugs, a clever near-acronym for Bayesian inference Using Gibbs Sampling. 
Developed at the MRC Biostatistics Unit at the University of Cambridge 
and initially described by Gilks, Thomas, and Spiegelhalter (1992), BUGS 
uses 5-like syntax for specifying hierarchical models. The program then 
determines the complete conditional distributions necessary for the Gibbs 
algorithm by converting this syntax into a directed acyclic graph (see Sub- 

eters in the model. BUGS successively samples from the parameter nodes, 
writing the output to a file for subsequent convergence assessment and 
posterior summarization. 

To see how BUGS determines the appropriate sampling chain, note that 
the full conditional distribution for each node can be inferred from the links 
in the graph and the conditional distribution at each node. That is, in the 
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notation of equation (C.l), 

p(v | V\v) <x p(v, V^\v) = p(V) , 

where V\v (read "V remove v") is the collection of all variables except for 
v. This distribution is proportional to the product of the terms in p(V) 
containing v, i.e., 

oc p(T|pa(r>)) p(w|pa(u>)) . 

uGpa(w) 

The first term on the right-hand side is the distribution of v given its 
parents, while the second is the product of distributions in which v is 
itself a parent. Thus, cycling through the node tree in the usual Gibbs 
updating fashion (see Subsection 5.4.2) produces a sample from the full 
joint distribution of V at convergence. 

Recall that if all of the variables v can assume only a finite number of 
values, then exact, noniterative algorithms based on the work of Lauritzen 
and Spiegelhalter (1988) are available for computing the posterior distribu¬ 
tion of any node in the graph. If some of the v are continuous, however, no 
such algorithm is available, but the preceding discussion shows the problem 
is still natural for Gibbs sampling. BUGS thus affords much more general 
graphical models than, say, Hugin, but also has the disadvantage of being 
both iterative and approximate, due to its MCMC nature. 

BUGS is able to recognize a number of conjugate specifications (e.g., nor¬ 
mal/normal, Poisson/gamma, etc.), so that the corresponding nodes can 
be updated quickly via direct sampling. Adaptive rejection sampling (Gilks 
and Wild, 1992) is used for other nodes having log-concave full condi¬ 
tionals; the remaining nodes can be updated via Metropolis sampling, or 
else discretized and their standardized full conditionals computed by brute 
force (as would be done for any discrete-finite node). Other helpful fea¬ 
tures include a variety of distributions (including the multivariate normal 
and Wishart), functions (including the log-gamma and matrix inverse), in¬ 
put capabilities (to facilitate multiple sampling chains or multiple priors), 
and advice on implementing a variety of Bayesian data analysis tools, such 
as reparametrization, model criticism, and model selection. An extensive 
examples manual is also available. 

Recently, development of the BUGS language has focused on its Windows 
version, WinBUGS. While "classic" BUGS uses only text representations of 
models and commands to drive the simulation, WinBUGS can also use a 
graphical interface called DoodleBUGS to construct the model, and a point- 
and-click Windows interface for controlling the analysis. There are also 
graphical tools for monitoring convergence of the simulation. The model 
specification and simulation are separate components: text versions of the 
model can be output from DoodleBUGS, and text model descriptions can 
also be used as input to the simulation program. WinBUGS also includes sev- 
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eral features not part of classic BUGS, including a more general Metropolis 
sampler, on-line graphical monitoring tools, space-efficient monitoring of 
running means and standard deviations, some automatic residual plots for 
repeated measures data, and capability for multiple sampling chains and 
Brooks-Gelman-Rubin diagnostics (see Brooks and Gelman, 1998). All the 
documentation specific to WinBUGS is packaged with the program. 

The most recent extensions to WinBUGS are PKBUGS, a version specifically 
tuned for pharmacokinetic (complex nonlinear) modeling, and GeoBUGS, 
which allows the the fitting and displaying of spatial models. GeoBUGS will 
ultimately include an importer for maps in Arc View, S-plus, and EpiMap 
format. 

A more extensive "back end" for convergence diagnosis or posterior sum¬ 
marization is provided by CODA (a musical analogy, and also an anagram of 
Convergence Diagnosis and Output Analysis), coda is a menu-driven suite 
of S -plus functions for investigating MCMC output in general, and BUGS 
output in particular. The program enables 

• Various summary statistics, such as means, quantiles, naive and batched 
standard errors, and Geweke (1992) "numerical standard errors;" 

• sample traces and density plots; 

• sample autocorrelations (within a chain for one parameter) and cross¬ 
correlations (across one or more chains for two parameters); 

• Several other popular statistics and plots for convergence diagnosis (see 
Subsection 5.4.6), including those of Gelman and Rubin (1992), Geweke 
(1992), Raftery and Lewis (1992), and Heidelberger and Welch (1983). 

Originally written by Prof. Kate Cowles while still a graduate student 
in the Division of Biostatistics at the University of Minnesota, CODA source 
code is now maintained and distributed along with BUGS and WinBUGS. 
CODA allows easy analysis of multiple sampling chains, changing of default 
settings for plots and analysis, and retaining of text and graphical output 
for subsequent display (indeed, several of the plots in this book were created 
in CODA) . The program runs on the same platforms as BUGS and WinBUGS, 
and source code is available. 

BUGS and CODA each come with comprehensive, easy-to-read user guides 
(Spiegelhalter et al., 1995a, and Best, Cowles, and Vines, 1995, respec¬ 
tively). A booklet of worked BUGS examples is also available (Spiegelhalter 
et al., 1995b). The programs and manuals are available over the web at 

http : //www.mrc-bsu.cam.ac.uk/bugs/ . 

In the future, the programs will be shareware, meaning that a nominal fee 
will be requested of those users requiring support. 

Finally, Brian Smith of the University of Iowa has written an S-plus/R 
revision of CODA called BOA {Bayesicm Output Analysis) for carrying out 
convergence diagnostics and statistical and graphical analysis of Monte 
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Carlo sampling output. BOA can be used as an output processor for BUGS, 
WinBUGS, or any other program which produces sampling output. It is quite 
stable, easy to use, feature rich, and freely available over the web at 

http: //www.public-health.uiowa.edu/boa/ 


bpois 

Doss and Narasimhan (1994) describe a Gibbs sampling program for Bayes¬ 
ian Poisson regression called bpois. The program combines Gibbs sampling 
with a graphical interface in the LISP-STA T programming language. It 
also enables sensitivity analysis via importance sampling with dynamic 
graphics. The program is fully menu-driven, and features dialog boxes with 
sliding control mechanisms to change the relevant hyperparameter values. 
The code is specific to Poisson likelihoods, but could be customized by the 
user willing to write the necessary C subroutines. The program is freely 
available from StatLib via the web at 

http: //lib.stat.emu.edu/xlispstat/bpois 

Of course, the user must already have access to the core LISP-STA T system; 
see the instructions above. 


MCSim 

Originally developed to do uncertainty analysis (simple Monte Carlo simu¬ 
lations) on models specified by ordinary differential equations (e.g., phar¬ 
macokinetic models), MCSim is a simulation package that allows the user to 
design statistical or simulation models, run Monte Carlo stochastic simu¬ 
lations, and perform Bayesian inference through MCMC simulations. The 
program is written in C, with both source code and a user's manual avail¬ 
able. The program and further information is available from the webpage 
maintained by the program's primary author, Frederic Bois, at 

http: //sparky.berkeley.edu/users/fbois/mesim.html 


BINGO 

An acronym for Bayesian Inference - Numerical, Graphical and Other stuff, 
this program is an extended version of Bayes Four, developed by Prof. 
Ewart Shaw at Warwick University. Built for portable APL workspaces, an 
initial version is available at 

http: //www.Warwick.ac.uk/statsdept/Staff/JEHS/index.htm 

bingo features a "multi-kernel" version of the Naylor and Smith (1982) 
algorithm, highly efficient ("designed") numerical integration rules, device¬ 
independent graphics, and facilities for other approaches. The user must 
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supply APL code for the likelihood and so on, similar to Bayes Four 
Other facilities include built-in reparametrizations, Monte Carlo impor¬ 
tance sampling and the Metropolis-Hastings algorithm, quasirandom se¬ 
quences, Laplace approximations, function maximization, APLTpX, and 
the compact device-independent graphics language WGL (Warwick Graph¬ 
ics Language, pronounced "wiggle"). 
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APPENDIX D 


Answers to selected exercises 


Chapter 1 

1. Though we may have no prior understanding of male hypertension, the 
structure of the problem dictates that 0 £ [0,1], providing some infor¬ 
mation. In addition, values close to each end of this interval seem less 
likely, since we know that while not all males are hypertensive, at least 
some of them are. 

Suppose we start with 0 = . 3 . Given the knowledge that 4 of the first 5 
men we see are in fact hypertensive, we would likely either (a) keep our 
estimate unchanged, since too few data have yet accumulated to sway 
us, or (b) revise our initial estimate upward somewhat - say, to .35. (It 
certainly would not seem plausible to revise our estimate downward at 
this point.) 

Once we have observed 400 hypertensive men out of a sample of 1000, 
however, the data's pull on our thinking is hard to resist. Indeed, we 
might completely discard our poorly-informed initial estimate in favor 
of the data-based estimate, 6 = .4. 

All of the above characteristics of sensible data analysis (plausible start¬ 
ing estimates, gradual revision, and eventual convergence to the data- 
based value) are features we would want our inferential system to pos¬ 
sess. These features are possessed by Bayes and empirical Bayes systems. 

5. (a) p-value = P(X > 13 or X < 4) = .049 under Hq. So, using the usual 
Type I error rate a = .05, we would stop and reject H 0 - 
(b) For the two-stage design, we now have 

p — value = P(Xi > 13 or X\ < 4) 

+ P(X i + X 2 > 29 and 4 < Xi < 13) 

+P(X 1 + X 2 < 15 and 4 < X x < 13) 

= .085 , 

which is no longer significant at a = .05! Thus even though the ob¬ 
served data was exactly the same, the mere contemplation of a second 
stage to the experiment (having no effect on the data) changes the 
answer in the classical hypothesis testing framework. 
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(c) The impact of adding imaginary stages to the design could be that 
the p-value approaches 1 even as xq remains fixed at 13! 

(d) Stricly speaking, since this aspect of the design (early stopping due 
to ill effects on the patients) wasn't anticipated, the p-value is not 
computable! 

(e) To us it suggests that p-values are less than objective evidence, since 
not only the form of the likelihood, but also extra information (like 
the design of the experiment), is critical to their computation and 
interpretation. 

6.(a) B = . 5, so e\y ~ JV(.5p + ,5y,.5(2)) =N(2,1). 

(b) Now B = .1, so 9\y ~ N(. 1/i + ,9y,.9(2)) =N(3.6,1.8). The in¬ 
creased uncertainty in the prior produces a posterior that is much 
more strongly dominated by the likelihood; the posterior mean shrinks 
only 10% of the way toward the prior mean p. The overall precision 
in the posterior is also much lower, reflecting our higher overall un¬ 
certainty regarding 0. Despite this decreased precision, a frequentist 
statistician might well prefer this prior, since it allows the data to 
drive the analysis. It is also likely to be more robust, i.e., to perform 
well over a broader class of ^-estimation settings. 


Chapter 2 


3. From Appendix Section A.2, since x 2 { I/ ) = G{vJ 2,2), we have that 
G(a, 2) = x 2 (2a). But if X ~ G(a, 2), a simple Jacobian transformation 
shows that Y = /3X/2 ~ G(a, fl). Finally, Z= l/Y ~ IG(a, (3) by 
definition, so we have that the necessary samples can be generated as 


<r 2 = where X ~ x 2 (2a) 


and 


«= 2+ a > 




4. Since 0 = 3i+ in this example, a difficulty with the histogram approach 
is its finite support (though in this example, terminating the histogram 
at, say, ten feet would certainly seem safe). Matching a parametric form 
avoids this problem and also streamlines the specification process, but an 
appropriate shape for a population maximum may not be immediately 
obvious, nor may the usual families be broad enough to truly capture 
your prior beliefs. 

5. (a) 6 ~ Beta(a, ft) 

(b) 6 ~ Beta(a,(3) (likelihood same as in (a)) 

(c) 9 ~ D(a) 
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(d) 0 ~ IG(a , b) 

(e) none available 

(f) e~N k (ti,v) 

(g) E~W(I2,v) 

6 . (a) scale; p(a) oc 1 /<r 

(b) location-scale; p(9, a) oc 1/a 

(c) location; p(p) oc 1 

(d) scale; p({3) oc 1//? 

(e) neither (a is often called a s/iape parameter) 

7. The Jeffreys prior for y is 


d 2 

J(t) = -£x| 7 ^T lo gP( x l7) • 


But, using the chain rule, the product rule, and the chain rule again, we 
can show that 

Since the second terms in these two products are constant with respect 
to x and the score statistic has expectation 0 , we have 

r/x _ jr r«* 2 logp(x|0)l (d6\ 2 


- *■■(*)■ 

Hence [/( 7)] 1 / 2 — [I{9)] l l 2 • \dO/dj\, as required. 
10.(a) For the joint posterior of 0 and <r 2 , we have 


P{d,cr 2 \y) oc (a 2 ) n/2 1 exp{-^ - 0) 2 } . 


This expression is proportional to an inverse gamma density, so inte¬ 
grating it with respect to a 2 produces 


P(%) oc [^(Vi - 0)‘ 


But using the identity Yl?=i(Vi ~ @) 2 = C + n(y — 6) 2 , where C = 

- y ) 2 - we have that 

p(%) oc [C + n(y - 0) 2 ]-"/ 2 oc [1 + t 2 /(n - 1 ))~ n l 2 . 
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(b) Here, the above identity must be employed right away, in order to 
integrate 9 out of the joint posterior (D.l) as proportional to a normal 
form. The result for p(<r 2 |y) is then virtually immediate. 


Chapter 3 


l.(a) For f(y\0) = N(y\6, 1), we have that 9 = y+f'(x\9)/f(x\9). The result 
follows from substituting this expression for 9i in the numerator of 
the basic posterior mean formula. 


E(0i Ik) 


f9jf(x i \9 l )dG(9 l ) 
f f(xi\9i)dG{0i) 


and simplifying. 

( Note: The solution to a more general version of this problem is given 
by Maritz and Lwin, 1989, p.73.) 

2. The computational significance of this result is that we may find the joint 
marginal density of y (formerly a /.'-dimensional integral with respect 
to 8 as the product over i of the result of a one-dimensional 

integral. Since m(y\r]) must emerge in closed form for computation of 
a marginal MLE f) to be feasible, this reduction in dimensionality plays 
an important role m PEB analysis. 


3.(a) 

(b) 


™(yi\v) = r(a + v)Vi 1 /[ r ( a )r(p)(2/t + i) a+r '] 

With a = 2, 

T / | \ rr r)(r] + l)j/i 

Taking the derivative of the log-likelihood with respect to 77 and set¬ 
ting the result equal to 0, we have that the marginal MLE p is a 
positive root of the quadratic equation 


Sr) 2 + {S — 2 k)r] - k — 0 , 
where S = ^Li l°g(S/i + !)• 

4(a) The marginal loglikelihood for B is proportional to 


Jb, „ 2 

jlojB- ■ 

i — 1 

Taking the derivative with respect to B, setting equal to 0, solving, 
and remembering the restriction that 0 < B < 1, we have that 


Bmle — min 


Eh y, 


The resulting PEB point estimates are 9f B = (1 — Bmle )Yi. 
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(b) Since ELi= 5||Y|| 2 ~ X 2 (fc) = G(k/ 2,2), it follows that 
11Yj] 2 ~ G(k/2,2/B), and hence that 1/(||Y|| 2 ) ~ IG(k/ 2, 2/B). 
Thus we can compute 


E(B) = E 


k-2 

Imp 


k-2 _B(k- 2) 


2 . Ik 
B 


( 1 - 1 ) 


k-2 


= B 


and so B is indeed unbiased for B. Hut then 


qEB 


= (1 - B)Yi = 1 - 


k-2 


Yi = ft 


JS 


6. We need to compute 


E 


Y| 6 


1 


Y IP 


Consider the case where cr 2 = 1 and replace Y by Z; the general case 
follows by writing Y = <rZ. Now || Z || 2 is distributed as a non-central 
chi-square on k df with non-centrality parameter -y =|| Q || 2 j‘l. But a 
non-central chi-square can be represented as a discrete mixture of central 
chi-squares, i.e., \ 2 (k + 2 J) where J ~ Po(j). Therefore, 


E> 


zi e 


i 


= E 


Z|7 


1 


= E 


i j -j 

-rre 7 - 

J- 


i= 0 - k + V- 2 ’ 

since l/x 2 (k + 2J) = IG{(k -f- 2j)/2, 2) (see Appendix Section A.2). 
9. The population moments in the marginal family rn(r,: \a, f3) are 

E{ri) = ±-E(Yi) = = E(9i) = cep 


and 


Var(n) 


_ l 


^Var(^) 


t 

= jj[Var(E(Yi\0i)) + E(Var(Yi\0i))] 


= ^[tf(aP 2 ) + U(ap)\ = a/? 2 + f,. . 

This results in the system of 2 equations and 2 unknowns 

r = a/? 

s 2 =a/? 2 +a/?ELi*rV fc 

Solving, we obtain a = r 2 /(s 2 - r Ei=i and /? = r/a. 
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Chapter 4 


2.(a) The MLE is equal to the Bayes rule when M = 0, and carries a risk 
( MSE) of£J X | 9 (i — 9) 2 = Var x = #(1 — 0)/n. Subtracting the 
risk of the Bayes estimate given in equation (4.4) and setting p = 
we have risk improvement by the Bayes estimate if and only if 


g(l-g) 

n 



M + n 


2 


M 

M + n 


(#") >0- (D.2) 


The two 9 values for which this expression equals 0 (available from 
the quadratic formula) are the lower and upper bounds of the risk 
improvement interval. Clearly this interval will be symmetric about 
0=4, since there the first term in (D.2) is maximized while the 
second is minimized (equal to 0). 

(b) Since 9\x ~ Beta(a + x, b + n — x), we have 

0 aa+x—1 
0 ( 1 - 0 ) _ 

_L_ Qa+x- 1(1 _ e y+n-x-l d0 

J9 a+X ~ 1 (l - 9) b+n ~ x ~ 2 d9 

f9a+x-2(l _ gy+n-x-2 d 9 
r(a+x)Y(b-\-n — x — l) 

_ r(a+6+n —1) 

l)F(&-f-n— x— 1) 

a + x — 1 
a + b + n — 1 

from the result of Appendix B, Problem 1. 

6.(b) This is actually the prior used by Miiller and Parmigiani (1995(.Using 
the "shortcut" method (sampling both 9 * and y* with N = 1), they 
obtain h =29. 

Even this mixture prior is amenable to a fully analytical solution, 
since the posterior distribution is still available as a mixture of beta 
densities. The authors then obtain the refined estimate n = 34. 




ayes 


f 


f 


(1 -9) b+n - x - 1 d9 


Chapter 5 

3. The probability p of accepting a rejection candidate is 


p = P(L(9 )tt( 9) > MUg{9)) 

mm 


-J 
- J 


P\U < 


L(OMO) 

Mg{9) 


Mg(9) 

g(9)d$ 


g(9)d9 
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4. The new posterior is 


= ^ / m*(o)de 

c 

M ' 


Pa(%) « ^yPi(%)- 

In the notation of the weighted bootstrap, we have g(9) — pi(#|y), and 
we seek samples from 


L{0H6) = ^ Pl( ^ |y) 

Thus if we compute 


w(6) = 


■* 2 ( 6 ) 

7Ti(6») 


_ K2{0l)/Kl(0i) 

qt Ef=i M8j)/xi(dj) 

then resampling 9 *s with probability qi provides a sample from P2(0\y)- 
This procedure is likely to work well only if 7 r 2 has lighter tails than jti . 

15. The Hastings acceptance ratio r is 

_ p(v)q(u, v) 

p{u)q(v, u) 

p(v)p(ui\uj&) 

p{u)p(Vi\u^i) 

p(Vj | ^i)p{Uj )p(uj \Uj ^i ) 

p(«» I U i ^i)p(Uj^i)p(Vi\Uj^i) 

= 1 , 


since v and 11 differ only in their i th components. 

17.(a) Since logpz(z) oc (5 — l)log 2 — f3z + fclog(l — e~ z ), we have 

d\ogp z (z) 8-1 k 

oc- fj + - 


and 


dz 

d 2 log pz(z) —(5—1) ke z 
dz 2 z~ (e z — l) 2 


1 


< 0 for 6 > 1 . 


Thus pz(z) is guaranteed to be log-concave provided 8 > 1. 
(b) Marginalizing the joint density, we have 




z t-i e -P* 


k 

JJ I(e-‘ ,l)( u i) d Ul ■■■du k 
i=1 
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-^(e -s ,l)(^* 'jd'Ui 

k 

= z s - l e~ fi * JJ(l-e-') 

«=1 

= z i- 1 e - ^*(l — e~ z ) k oc p z (z) 

(c) The full conditional for U is 



k k 

Pu\z{u\z) cxp Ut z(u,z) oc = JJ Unif(e ~ z , i) , 

i=l i=l 

i.e., conditionally independent uniform distributions. For Z, we have 

Pz\u{A u ) <* Z s - 1 e~ l3z l( e -> t i)(ui) • ••/(<.-.,!)(«*) • 

So e -2 < Ui for all i implies z > -logu, for all i, which in turn 
implies z > - log u min where u min = min( m ; .e/*). Hence, 

Pz|tt(^l«) a z l5_1 e _ ^ 2 /(_ io gUm ,„ , oo)(z) , 

a truncated gamma distribution. Alternately sampling from pu | z and 
Pz|£/ produces a Gibbs sampler which converges to pu y z\ peeling off 
the Z { - 9 - draws after convergence yields the desired sample from the 
D-distribution Pz ■ Special subroutines required to implement this al¬ 
gorithm would include a Unif(0, 1) generator, so that the Ui draws 
could be computed as e~ z -f (1 — e~ z ) ■ Unif(0, 1), and a gamma cdf 
and inverse cdf routine, so that the z draws could be found using the 
truncated distribution formula given in Chapter 7, problem 4. 


19. We have that 
Axl; 


Hence, 


\x-A-x\y 


Pi ..... 

P3 

P1+P3 

P 2 

P1+P3 
P 4 

P2+P4 

P 2 +P 4 

Pi 

P2 

Pl+P2 

P 3 

Pl+P2 

Pi 

P3+P4 

P3+P4 


fxA x \ x = (pi+p 2 P3+P4 ) 

= ( Pi + P3 P2+Pa) 

= /, ■ 


Pi 

P2 \ 

P1+P2 

_P_3 _ 

P1+P2 | 

V 4 1 

P3+P4 

P3+P4 / 

/ Pi 

P2 

P1+P2 

i P3 

Pi +P2 
Pi 

' P3+P4 

P3+P4 
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20.(b) Analytically, we have 


/•OO fOO 

fx(x) = / / ye~ yx te~ ty dy 

Jo Uo 


fx(t)dt 


- f 


0 U^ + O 2 


fx(t)dt 


f x ( t) = 1 It is a solution to this equation. But no Gibbs convergence is 
possible here, since this is not a density function; f x (t)dt — +oo. 
The trouble is that the complete conditionals must determine a proper 
joint density. 

22.(a) p(0i\8 2 ,y) and p{9 2 \9 2 , y) are 


AT ( b \{y - 0 2 ) + a x b\ \ 

\ i+6? ■ i+6?; 

and 

N ( b l(y - 0i) + ° 2 6? \ 

V 1 + 6? ’ 1 + blJ ’ 

respectively. 

(b) p(0i\y) and p(0 2 \y) are 

/ 6?(t/-a 2 ) + (l+6?)ai 6?(1 + 6?) \ 

\ 1 + 6? + b\ l + 6? + 6?y 


and 

/ b\{y - a,]) + (1 + 6?)q 2 6?(1 + 6?) \ 

\ 1 + 6 ? + 6 2 1 + 6 ? + 6 | / 
respectively, so the data do indeed inform (i.e., there is "posterior 
learning"). 

(c) Answer provided in Figure D. 1. Convergence is obtained immediately 
for n but never for 0\ or 0- 2 . Note that the estimates of E(y\y) at 100 
and 1000 iterations are numerically identical, 
fd) Answer provided in Figure D.2. Now we see very slow convergence 
for p, due to the very slow convergence for the (barely identified) 6>x 
and 6 2 . The estimates of E(/i\y) at 100 and 1000 iterations are now 
appreciably different (with only the latter being correct), but there 
is no indication in the 100 iteration p plot, G&R statistic, or lag 1 
sample autocorrelation that stopping this early is unsafe! 

(e) This example provides evidence against the recommendation of sev¬ 
eral previous authors that when only a subset of a model's param¬ 
eters are of interest, convergence of the remaining parameters need 
not be monitored. While completely unidentified parameters in an 
overparametrized model may be safely ignored, those that are identi¬ 
fied only by the prior must still be monitored, and slow convergence 
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thereof must be remedied. As a result, monitoring a representative 
subset of all the parameters present is recommended, along with the 
use of simple reparametrizations where appropriate to speed overall 
convergence. 


Chapter 6 


1. Standard hierarchical Bayes calculations yield the following complete 
conditional distributions: 


0|/3,A,<r 2 ,y oc 


0\O,\,o- 2 ,y ~ 

N(Bb, B), 

where B~ l = 


and b = 

E-'y + Eo 1 ^ 

<r 2 |0,/3,A,y ~ 

IG (a 0 + — , 




i =1 


Including the A, complete conditionals given in Example 6.2, all of these 
are available in closed form with the exception of that of 9 , for which 
one can use a rejection sampling method (e.g., Metropolis-Hastings). 

3. The posterior probability of # 0 , P G (0 g [6 L ,9u]\x), is given by 



[9 u -9 l I [«l,Bu](°) + ( 1 7r )</( S ) 

dO 

//(*!“) i 

> u -e L I lH,eu]{ u ) + (! 

du 


Setting this expression < p and keeping in mind that g{9) has no support 
on the interval [9 L , 6u], we obtain 7i c as the set of all G such that 




4. While the faculty member's solution is mathematically correct, it is not 
practical since we lack a closed form for the ratio in the expression (i.e. 
p(0|y W ) and p(0\y) have different and unknown normalizing constants). 
A conceptually simple alternative would be to run the MCMC algorithm 
again after deleting y i: obtaining 9 l9) samples from p(0|y(;)) and subse- 
queenly obtaining the conditional predictive mean as 

E(Vi\y(i)) = j E{y l \e)p(9\y (i) )dd 
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^f>(2A \ 0 (g) )- 


A more sophisticated approach (suggested by a more sophisticated fac¬ 
ulty member. Prof. Hal Stern) would be to write 


/ 


E(yi\9)p(6\y {i ))d8 = 


j E( yi \e)f(y (i) \e)ir(e)de 


J7(y(i)|0M0)d0 

jew 


EL E{y, 


j(.y\0 {gT ) 


') 


y-G j(y M \9 (!l) ) 
^ g=1 f(y\9 w ) 


where the samples now come directly from the full posterior p(0|y), 
which plays the role of the importance sampling density g in (5.8). In 
the second line above, the prior terms tt(8) cancel in both the numerator 
and denominator, while the marginal density terms m(y ) can be pulled 
through both integrals and cancelled, since they are free of 9. Note this 
solution will be accurate provided p(0|y) ~ p(0|y(i)), which should hold 
unless the dataset is quite small or y ■ is an extreme outlier. 

5. We have 


f(vi\y(i)) 


= J f(yi\o,y(i))p(0\y(i))d9 
= J f(yiW)p{Q\y(i))d9 

« j f(yi\9)p(9\y)de 


where are the MCMC samples from p(6\y). Since the CPO is the 
conditional likelihood, small values indicate poor model fit. Thus we 
might classify those datapoints having CPO values in the bottom 5- 
10% as potential outliers meriting further investigation. 

6. We have 

Pd = J P[D(y*,9)>D(y,9)\e}p(9\z)d6 

= j J / { £»(y.,e ) >£ ) (y,e)}(y*^)/(yl e )p(®|z) d y d0 
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method 

P(M = 2|y) 

SD 

BF 21 

P( i) 

time 

CC 

.70806 

.001721 

4848.4 

.567 

22 . 8 ” 

RJ-M 

.70861 

.004058 

4861.3 

.589 

18.7” 

RJ-G 

.70906 

.002394 

4871.9 

.593 

7.9” 

RJ-R 

.70750 

.002004 

4835.1 

.660 

6.7” 

Chib-1 



4860.7 


13.6” 

Chib-2 



4860.3 


14.0” 

target 

.70865 


4862 




Table D.l Comparison of different methods, for the simple linear regression ex¬ 
ample. Here, CC — Carlin and Chib's method; RJ-M = reversible jump using 
Metropolis steps if the current model is proposed; RJ-G = reversible jump using 
Gibbs steps if the current model is proposed; RJ-R = reversible jump on the re¬ 
duced model (i.e., with the regression coefficients integrated out): Chib-1 = Chib's 
method evaluated at posterior means; and Chib-2 = Chib's method evaluated at 
frequentist LS solutions. 

1 G 

~ ~gLj j! {D(y*( 9 ) ! 6» (3) )> J D(y,0 (5) )} ly ° ’ ■ 

a -1 

where (y*^, 0 ^) ~ p(y*,9\z) = f(y* \9)p(8\z). and Is denotes the 
indicator function of the set S. That is, 

, number of D(y* {g \ 8 ( ' 9) )s > D(y.8^ 9) ) 

Pd- g ' 

The sampler may not have been designed to produce predictive samples 
y* 1 g> but such samples are readily available at iteration g by drawing 
from /(y* \ 9 1 ' 91 ), a density that is available in closed form by construc¬ 
tion. 

8 . For each method, we run five independent chains for 50,000 iterations 
following a 10,000 iteration burn-in period. Table D.l summarizes our 
results, reporting the estimated posterior probability for model 2 , a 
batched standard deviation estimate for this probability (using 2500 
batches of 100 consecutive iterations), the Bayes factor in favor of model 
2, BF 2 i, the lag 1 sample autocorrelation for the model indicator, p( 1 ), 
and the execution time for the FORTRAN program in seconds. Model 2 is 
clearly preferred, as indicated by the huge Bayes factor. The "target" 
probability and Bayes factor are as computed using traditional (non- 
Monte Carlo) numerical integration by Green and O’Hagan (1998). Pro¬ 
gramming notes for the various methods are as follows: 
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(a) 


(b) 


(c) 


For Chib's (1995) marginal density method, we may use a two-block 
Gibbs sampler that treats a 2 or r 2 as 0 lf and (a,f3)' or ( 7 , <5)' as Q 2 . 
We experimented with two different evaluation points ( 0[. 0 '.,): see 
the table caption. 

For Carlin and Chib's (1995) product space search, we have 0\ = 
(a, /?, rr) and 0 2 — ( 7 , 6, r). For the regression parameters, one may 
use independent univariate normal pseudopriors that roughly equal 
the corresponding first-order approximation to the posterior. Accept¬ 
able choices are a\(M = 2) ~ iV(3000, 52 2 ), 0\(M = 2) ~ N(185, 12 2 ), 
y\(M = 1) ~ N( 3000,43 2 ), and 6\(M=l) ~ (V(185,9 2 ). For ^ and 
r 2 , taking the pseudopriors equal to the 1G priors produces accept¬ 
able results. A bit of tinkering reveals that = .9995, 7 t 2 = .0005 
produces M^ 9> iterates of 1 and 2 in roughly equal proportion. 

For Green's (1995) reversible jump algorithm, we use log transforms 
of the error variances, i.e., A = logcr 2 , u> = logr 2 , to simplify the 
choice of proposal density. The probabilities of proposing new mod¬ 
els are h(l, l) = h( 1, 2) = h(2,l) = h(2, 2) = 0.5. The dimension 
matching requirement is automatically satisfied without generating 
an additional random vector; moreover, the the Jacobian term in 
the acceptance probability is equal to 1. Since the two regression 
models are similar in interpretation, when a move between mod¬ 
els is proposed, we simply set (a,/3, A)' = (7 or ( 7 , 6,u>)‘ = 
(a, (3, A) . The acceptance probabilities are then given by — 


mm 


i 1 ’ 


/(yWi'VAt= 


1 

ijTTi J ’ 


and c* 2 —>i = min 


V’ /fy|7,'5^,M=2)7r 2 /• 


f(y\ a ,0,x,M-- 

When the proposed model is the same as the current model, we update 
using a standard current-point Metropolis step. That is, for model 


1 we draw (a*,/3*, A*)' ~ N ((»<*), fl k) , \ {k )', Diag(mO, 250, l)j 

and set a( k+1 \ ft k+1 \\( k+1) )' = (a*,/?*, A*/ with probability r = 
p(a *, /?*, \*)/p(a( k \ f3( k \ A ( *^). Similar results hold for model 2. Al¬ 
ternatively, we may perform the "within model" updates using a stan¬ 
dard Gibbs step, thus avoiding the log transform and the trivariate 
normal proposal density, and simply using instead the corresponding 
full conditional distributions. 

Finally, we may also use the reversible jump method on a somewhat 
reduced model, i.e., we analytically integrate the slopes and inter¬ 
cepts (a, f3 , 7 , and 6 ) out of the model and use proposal densities 
for <t 2 and r 2 that are identical to the corresponding priors. The 
acceptance probabilities are ai_, 2 =min|l, and 


a 2 _> 1 = min < 1, 


iri f(y <? 2 ,m= i) l 
*2 /(y t2,m=2) j 


Summarizing Table D.l, the reversible jump algorithm operating on the 
reduced model (with the regression coefficients integrated out) appears 
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to be slightly more accurate and faster than the corresponding algo¬ 
rithms operating on the full model. Of course, some extra effort is re¬ 
quired to do the integration before programming, and posterior samples 
are obviously not produced for any parameters no longer appearing in 
the sampling order. The marginal density (Chib) method does not re¬ 
quire preliminary runs (only a point of high posterior density. O'), and 
only a rearrangement of existing computer code. The accuracy of results 
produced by this method is more difficult to assess, but is apparently 
higher for the given runsize, since repeated runs produced estimates for 
BF 21 consistently closer to the target than any of the other methods. 

Chapter 7 

5. The Bayesian model is 

k k 

P( y,0|A) a Y[0?‘ exv(-ei6i)Y[0?~ l p- a exp(~6i/0)Is,(9i) ■ 

i = 1 t = 1 

Hence, using the results of Gelfand, Smith, and Lee (1992), the full 
conditionals for the 8i are given by 

| «*, (D.3) 

for i = 1. k, where a* = a + d<, /?* = (/l -1 + ej) -1 , do = 0, and 

Q k+ 1 = B. The complete conditional for 0 is readily available as the 
untruncated form 

p(j%, 6) = IG^a + ka, jfc" 1 + £ 

The result in the previous problem could be used in generating from the 
truncated gamma distributions in (D.3), provided subroutines for the 
gamma cdf (incomplete gamma function) and inverse cdf were available. 

6. (b) The simplest solution here is to restrict 0 to he in an increasing convex 

constraint set, 

{0 : 0i > 0 , 6 k < B , 0 < 0 2 - 0i < • • • < 9 k - Ok- 1 } ■ 

However, Gelman (1996) points out that this is a very strong set of 
constraints, and one which becomes stronger as k increases (e.g., as 
we graduate more years of data, or even subdivide our yearly data into 
monthly intervals). It is akin to fitting a quadratic to the observed 
data. 

9.(a) This is the approach taken by Carlin and Poison (1992). 

(b) This is the approach taken by Albert and Chib (1993a) and (in a 
multivariate response setting) by Cowles et al. (1996). 
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10. (a) Running 3 chains of our Gibbs-Metropolis sampler for 1000 iterations 

produced the following posterior medians and 95% equal-tail credi¬ 
ble intervals: for <p, 1.58 and (0.28, 4.12); for X(syi), .085 and (.028, 
.239); and for X(sb), .077 and (.021, .247). The spatially smoothed 
predictions at points A and B are consistent with the high values at 
monitoring stations 4 and 5 to the east, the moderate value at station 
3 to the west, and the low value at station 8 to the far south. 

(b) Ozone is a wind-borne pollutant, so if a prevailing wind (say, from 
west to east) dominated Atlanta's summer weather pattern, it would 
be reasonable to expect some anisotropy (in this hypothetical case, 
more spatial similarity west to east than north to south). 

A credible interval for /? 12 that excluded 0 would provide evidence 
that the anisotropic model is warranted in this case. One might also 
compare the two models formally (assuming proper priors) via Bayes 
factors (see Subsections 6.3 and 6.4) or informally via predictive or 
penalized likelihood criteria (see Subsection 6.5). 

11. Based on 1000 Gibbs samples from single chain generated in the BUGS 
language (and collected after a 500-iteration burn-in period), Spiegel- 
halter et al. (1995b) report the following posterior mean and standard 
error estimates: 

(a) a, .73 ± .14; Kg, —.54 ± .16. Thus, the observed rates are slightly 
lower than expected from the standard table, but higher for those 
counties having a greater percentage of their population engaged in 
agriculture, fishing, and forestry (AFF). 

(b) a, .39 ± .12 (an overall intercept term cannot be fit in the presence 
of our translation-invariant CAR prior). While the covariate remains 
significant in this model, its reduced importance would appear due 
to this model's ability to account for spatial similarity. (We would 
expect some confounding of these two effects, since counties where 
AFF employment is high would often be adjacent.) 

(c) Spiegelhalter et al. (1998) obtain pn — 39.7, DIC = 101.9 for the 
part (a) model (covariate plus exchangeble random effects), and pu = 
29.4, DIC — 89.0 for the part (b) model (covariate plus spatial ran¬ 
dom effects). Thus, DIC prefers the spatial model, largely due to its 
slightly smaller effective sample size. 

(d) The fact that the spatial plus covariate model fits slightly better than 
the heterogeneity plus covariate model suggests that there is some 
excess spatial variability in the data that is not explained by the 
covariate. To get a better idea of this variability, we could map the 
fitted values <pi = E(<pi\y), thinking of them as spatial residuals. A 
pattern in this map might then indicate the presence of a lingering 
spatially-varying covariate not yet accounted for. 
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(e) Suppose we begin with proper hyperpriors for r and A. The transla¬ 
tion invariance of the CAR prior forces to set Kg = 0. Alternatively, 
we could estimate an overall level Kg after imposing the constraint 
Hi Si = 0 and/or = 0 . 


Appendix B 

1. This proof is similar to that given in Example B.5. 

2. For action a , the posterior risk is 

p{7r,a) = l(0i,a)ir(0i |x) 

i 

= 1 — 7r(a|x) , 

since 1(9}, a) = 1 in the sum unless 6i = a , and 7r(0, |x) = 1. To 
minimize this quantity, we take a to maximize r(a|x), hence the Bayes 
action is the posterior mode. 

9. The optimal value of c is 

+ (^) 2 ' 

Because c < 1, the optimal estimator shrinks the unbiased estimator 
toward 0 by an amount that depends on the relative size of a and 
(pf3). As o- —► oo, c —>■ 0 (variance reduction is all important). When 
<7 —> 0, c —> 1 (when the variance is small, bias reduction is the 
dominant goal). 
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Figure D.l MCMC analysis ofN( 9 2 + 8 2 , 
ha\’e prior mean 50 and prior standard devi 


© 2000 by CRC Press LLC 


G&R 4.25,7.88 ; acfl 0.99 


mean and sd: 39.36 23.88 



iteration 

G&R 4.23,7.87 ; acfl 0.989 



mean and sd: -39.36 23.9 




-100 -80 -60 -40 -20 0 20 



likelihood, Case 1: both parameters 
ion 1000. 







































G&R 1.62,2.45 ; acfl 0.972 


mean and sd: 22.814.16 


G&R 1.14 , 1.33 ; acfl 0.975 


mean and sd: 2.22 10.3 


) 20 40 60 80 100 



0 20 40 60 


0 200 400 600 800 1000 


-20 0 20 40 60 


G&R 1.6,2.42 ; acfl 0.971 


A 1 J J 


20 40 60 80 100 


mean and sd: -22.08 14.06 


G&R 1.13,1.32; acfl 0.974 


mean and sd: -1.71 10.23 



-60 -40 -20 


0 200 400 600 800 1000 


1 . 


-40 -20 0 20 


G&R 1 ,1.01 ; acfl -0.003 


mean and sd: 0.72 1 


G&R 1,1; acfl -0.04 


mean and sd: 0.51 1.01 




Figure D.2 MCMC analysis of N(Q 1 -j -62,!) likelihood, Case 2: both parameters 
have prior mean 50 and prior standard deviation 10. 


© 2000 by CRC Press LLC 





























References 


Abrams, D.I., Goldman, A.I., Launer, C., Korvick, J.A., Neaton, J.D., Crane, 
L.R., Grodesky, M.. Wakefield, S., Muth, K., Kornegay, S., Cohn, D.L., Har¬ 
ris, A.. Luskin-Hawk. R., Markowitz, N.. Sampson. J.H., Thompson, M., Dey- 
ton. L., and the Terry Beirn Community Programs for Clinical Research on 
AIDS (1994). Comparative trial of didanosine and zalcitabine in patients with 
human immunodeficiency virus infection who are intolerant of or have failed 
zidovudine therapy. New England Journal of Medicine , 330. 657-662. 

Adler, S.L (1981). Over-relaxation method for the Monte Carlo evaluation of the 
partition function for multiquadratic actions. Physical Review D. 23, 2901- 
2904. 

Agresti, A. (1990). Categorical Data Analysis. New York: John Wiley. 

Albert, J.H. (1993). Teaching Bayesian statistics using sampling methods and 
MINITAB. The American Statistician, 47, 182-191. 

Albert, J.H. (1996). Bayesian Computation Using Minitab. Belmont, CA: 
Wadsworth. 

Albert, J.H. and Chib, S. (1993a). Bayesian analysis of binary and polychotomous 
response data. J. Amer. Statist. Assoc., 88, 669-679. 

Albert, J.H. and Chib, S. (1993b). Bayes inference via Gibbs sampling of autore¬ 
gressive time series subject to Markov mean and variance shifts. Journal of 
Business and Economic Statistics, 11, 1-15. 

Almond, R.G. (1995). Graphical Belief Modeling. London: Chapman and Hall. 

Ames, B.N.. Lee, F.D., and Durston. W.E. (1973). An improved bacterial test 
system for the detection and classification of mutagens and carcinogens. Pro¬ 
ceedings of the National Academy of Sciences ,70,782-786. 

Andrews, D.F. and Mallows, C.L. (1974). Scale mixtures of normality. J. Roy. 
Statist. Soc., Ser. B, 36,99-102. 

Antoniak, C.E. (1974). Mixtures of Dirichlet processes with applications to non- 
parametric problems. Ann. Statist., 2, 1152-1174. 

Bates, D.M. and Watts. D.G. (1988). Nonlinear Regression Analysis and its Ap¬ 
plications. New York: Wiley. 

Bayes. T. (1763). An essay towards solving a problem in the doctrine of chances. 
Philos. Trans. Roy. Soc. London, 53, 370-418. Reprinted, with an introduction 
by George Barnard, in 1958 in Biometrika, 45, 293-315. 

Becker. R.A., Chambers, J.M., and Wilks, A.R. (1988). The New S Language: 
A Programming Environment for Data Analysis and Graphics . Pacific Grove, 
CA: Wadsworth and Brooks/Cole. 


© 2000 by CRC Press LLC 




Berger, J.O. (1984). The robust Bayesian viewpoint (with discussion). In Robust¬ 
ness in Bayesian Statistics, ed. J. Kadane, Amsterdam: North Holland. 

Berger, J.O. (1985). Statistical Decision Theoty and Bayesian Analysis, 2nd ed. 
New York: Springer-Verlag. 

Berger, J.O. (1994). An overview of robust Bayesian analysis (with discussion). 
Test, 3, 5-124. 

Berger, J.O. (2000). Bayesian analysis: A look at today and thoughts of tomorrow. 
To appear, J. Ainer. Statist. Assoc. 

Berger, J.O. and Berliner, L.M. (1986). Robust Bayes and empirical Bayes anal¬ 
ysis with e-contaminated priors. Annals of Statistics, 14, 461-486. 

Berger, J.O. and Berry, D.A. (1988). Statistical analysis and the illusion of ob¬ 
jectivity. American Scientist, 76. 159-165. 

Berger, J.O. and Delampady, M. (1987). Testing precise hypotheses (with discus¬ 
sion). Statistical Science, 2, 317-52. 

Berger, J.O. and Pericchi, L.R. (1996). The intrinsic Bayes factor for linear mod¬ 
els. In Bayesian Statistics 5, J.M. Bernardo, J.O. Berger, A.P. Dawid, and 
A.F.M. Smith, eds., Oxford: Oxford University Press, pp. 25-44. 

Berger, J.O. and Sellke, T. (1987). Testing a point null hypothesis: The irrec¬ 
oncilability of p values and evidence (with discussion). J. Am. Statist. Assoc., 
82, 112-122. 

Berger, J.O. and Wolpert, R. (1984). The Likelihood Principle. Hayward, CA: 

Institute of Mathematical Statistics Monograph Series. 

Berger. R.L. and Hsu, J.C. (1996). Bioequivalence trials, intersection-union tests 
and equivalence confidence sets (with discussion). Statistical Science, 11, 283- 
319. 

Bernardinelli, L., Clayton, D.G., and Montomoli, C. (1995). Bayesian estimates 
of disease maps: How important are priors? Statistics in Medicine, 14, 2411- 
2431. 

Bernardinelli, L., Clayton, D.G., Pascutto, C.. Montomoli, C., Ghislandi, M., and 
Songini, M. (1995). Bayesian analysis of space-time variation in disease risk. 
Statistics in Medicine, 14, 2433-2443. 

Bernardinelli, L. and Montomoli, C. (1992). Empirical Bayes versus fully Bayesian 
analysis of geographical variation in disease risk. Statistics in Medicine, 11, 
903-1007. 

Bernardo, J.M. (1979). Reference posterior distributions for Bayesian inference 
(with discussion). J. Roy. Statist. Soc. B, 41, 113-147. 

Bernardo, J.M., Berger, J.O., Dawid, A.P., and Smith, A.F.M.. eds. (1992). 

Bayesian Statistics 4. Oxford: Oxford University Press. 

Bernardo, J.M., Berger, J.O., Dawid, A.P., and Smith, A.F.M.. eds. (1996). 

Bayesian Statistics 5. Oxford: Oxford University Press. 

Bernardo, J.M., Berger, J.O., Dawid, A.P., and Smith, A.F.M.. eds. (1999). 

Bayesian Statistics 6. Oxford: Oxford University Press. 

Bernardo, J.M., DeGroot, M.H., Lindley, D.V., and Smith, A.F.M.; eds. (1980). 

Bayesian Statistics. Valencia: University Press. 

Bernardo, J.M., DeGroot, M.H., Lindley, D.V., and Smith, A.F.M., eds. (1985). 

Bayesian Statistics 2. Amsterdam: North Holland. 

Bernardo, J.M.. DeGroot. M.H.. Lindley. D.V., and Smith, A.F.M, eds. (1988). 


© 2000 by CRC Press LLC 



Bayesian Statistics 3. Oxford: Oxford University Press. 

Bernardo, J.M. and Smith, A.F.M. (1994). Bayesian Theory. New York: Wiley. 

Berry, D.A. (1991). Bayesian methods in phase III trials. Drue Information Jour¬ 
nal , 25, 345-368. 

Berry. D.A. (1993). A case for Bayesianism in clinical trials (with discussion). 
Statistics in Medicine , 12, 1377-1404. 

Berry, D.A. (1996). Statistics: A Bayesian Perspective. Belmont, CA: Duxbury. 

Berry, D.A., Muller, P., Grieve, A.P., Smith, M., Parke, T., Blazek. R.. Mitchard, 
N., Brearly, C., and Krams, M. (2000). Bayesian designs for dose-ranging drug 
trials (with discussion). To appear in Case Studies in Bayesian Statistics, Vol¬ 
ume V , C. Gatsonis et ah. eds.. New York: Springer-Verlag. 

Berry, D.A. and Stangl, D.K., eds. (1996). Bayesian Biostatistics. New York: 
Marcel Dekker. 

Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems 
(with discussion). J. Roy. Statist. Soc., Ser. B, 36, 192-236. 

Besag, J. (1986). On the statistical analysis of dirty pictures (with discussion). 
J. Roy. Statist. Soc., Ser. B, 48, 259-302. 

Besag, J. and Green. P.J. (1993). Spatial statistics and Bayesian computation 
(with discussion). J. Roy. Statist. Soc., Ser. B. 55, 25-37. 

Besag, J., Green, P., Higdon, D., and Mengersen, K. (1995). Bayesian computa¬ 
tion and stochastic systems (with discussion). Statistical Science , 10, 3-66. 

Besag, J., York, J.C., and Mollie, A. (1991). Bayesian image restoration, with 
two applications in spatial statistics (with discussion). Annals of the Institute 
of Statistical Mathematics, 43, 1-59. 

Best, N.G., Cowles, M.K., and Vines, K. (1995). CODA: Convergence diagnosis 
and output analysis software for Gibbs sampling output. Version 0.30. Tech¬ 
nical report. Medical Research Council Biostatistics Unit. Institute of Public 
Health, Cambridge University. 

Best, N.G., Ickstadt, K., Wolpert. R.L.. Cockings, S., Elliott. P., Bennett, J., 
Bottle, A., and Reed, S. (2000). Modeling the impact of traffic-related air 
pollution on childhood respiratory illness (with discussion). To appear in Case 
Studies in Bayesian Statistics, Volume V , C. Gatsonis et al., eds.. New York: 
Springer-V erlag. 

Birnbaum, A. (1962). On the foundations of statistical inference (with discussion). 
J. Amer. Statist. Assoc., 57, 269-326. 

Bliss, C.I. (1935). The calculation of the dosage-mortality curve. Annals of Ap¬ 
plied Biology ,22, 134-167. 

Box, G.E.P. and Tiao, G. (1973). Bayesian Inference in Statistical Analysis. Lon¬ 
don: Addison-Wesley. 

Box, G.E.P. and Muller, M.E. (1958). A note on the generation of random normal 
deviates. A nn. Math. Statist., 29, 610-611. 

Brandwein, A.C. and Strawderman, W.E. (1990). Stein estimation: The spheri¬ 
cally symmetric case. Statistical Science, 5, 356-369. 

Breslow, N.E. (1984). Extra-Poisson variation in log-linear models. Appl. Statist., 
3,38-44. 

Breslow, N.E. (1990). Biostatistics and Hayes (with discussion). Statistical Sci¬ 
ence, 5, 269-298. 


© 2000 by CRC Press LLC 



Breslow, N.E. and Clayton, D.G. (1993). Approximate inference in generalized 
linear mixed models. J. Amer. Statist. Assoc., 88, 9-25. 

Broffitt, J.D. (1988). Increasing and increasing convex Bayesian graduation (with 
discussion). Trans. Soc. Actuaries, 40, 115-148. 

Brooks, S.P. and Gelman, A. (1998). General methods for monitoring convergence 
of iterative simulations. J. Comp. Graph. Statist., 7, 434-455. 

Brooks, S.P. and Roberts, G.O. (1998). Convergence assessment techniques for 
Markov chain Monte Carlo. Statistics and Computing, 8, 319-335. 

Brown, L.D. (2000). An essay on statistical decision theory. To appear, J. Atner. 
Statist. Assoc. 

Burden, R.L.. Faires, J.D.. and Reynolds, A.C. (1981). Numerical Analysis, 2nd 
ed. Boston: Prindle, Weber and Schmidt. 

Butler, S.M. and Louis, T.A. (1992). Random effects models with non-parametric 
priors. Statistics in Medicine, 11 , 1981-2000. 

Carlin, B.P. (1992). A simple Monte Carlo approach to Bayesian graduation. 
Trans. Soc. Actuaries, 44, 55-76. 

Carlin, B.P. (1996). Hierarchical longitudinal modeling. In Markov Chain Monte 
Carlo in Practice , eds. W.R. Gilks, S. Richardson, and D.J. Spiegelhalter, 
London: Chapman and Hall, pp. 303-319. 

Carlin, B.P., Chaloner, K., Church, T., Louis, T.A., and Matts, J.P. (1993). 
Bayesian approaches for monitoring clinical trials with an application to tox¬ 
oplasmic encephalitis prophylaxis. The Statistician, 42, 355-367. 

Carlin, B.P., Chaloner. K.M., Louis, T.A., and Rhame, F.S. (1995). Elicitation, 
monitoring, and analysis for an AIDS clinical trial (with discussion). In Case 
Studies in Bayesian Statistics, Volume II, eds. C. Gatsonis, J.S. Hodges, R.E. 
Kass and N.D. Singpurwalla, New York: Springer-Verlag, pp. 48-89. 

Carlin, B.P. and Chib, S. (1995). Bayesian model choice via Markov chain Monte 
Carlo methods. J. Roy. Statist. Soc., Ser. B, 57, 473-484. 

Carlin, B.P. and Gelfand, A. E. (1990). Approaches for empirical Bayes confidence 
intervals./ Amer. Statist. Assoc., 85, 105-114. 

Carlin, B.P. and Gelfand, A.E. (1991a). A sample reuse method for accurate 
parametric empirical Bayes confidence intervals. J. Roy. Statist. Soc., Ser. B. 
53, 189-200. 

Carlin, B.P. and Gelfand, A.E. (1991b). An iterative Monte Carlo method for 
nonconjugate Bayesian analysis. Statistics and Computing, 1, 119-128. 

Carlin, B.P., Kadane, J.B., and Gelfand. A.E. (1998). Approaches for optimal 
sequential decision analysis in clinical trials. Biometrics , 54, 964-975. 

Carlin, B.P., Kass, R.E., Lerch, F.J., and Huguenard, B.R. (1992). Predicting 
working memory failure: A subjective Bayesian approach to model selection. 

J. Amer. Statist. Assoc., 87, 319-327. 

Carlin, B.P. and Klugman, S.A. (1993). Hierarchical Bayesian Whittaker gradu¬ 
ation. Scandinavian Actuarial Journal, 1993.2 , 183-196. 

Carlin, B.P. and Louis, T.A. (1996). Identifying prior distributions that produce 
specific decisions, with application to monitoring clinical trials. In Bayesian 
Analysis in Statistics and Econometrics: Essays in Honor of Arnold Zellner, 
eds. D. Berry, K. Chaloner, and J. Geweke, New York: Wiley, pp. 493-503. 

Carlin, B.P. and Louis, T.A. (2000). Empirical Bayes: past, present and future. 


© 2000 by CRC Press LLC 



To appear, J. Amer. Statist. Assoc. 

Carlin, B.P. and Poison. N.G. (1991). Inference for nonconjugate Bayesian models 
using the Gibbs sampler. Canad. J. Statist., 19, 399-405. 

Carlin, B.P. and Poison. N.G. (1992). Monte Carlo Bayesian methods for discrete 
regression models and categorical time series. In Bayesian Statistics 4, J.M. 
Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith, eds., Oxford: Oxford 
University Press, pp. 577-586. 

Carlin, B.P., Poison, N.G., and Staffer, D.S. (1992). A Monte Carlo approach to 
nonnormal and nonlinear state-space modeling. J. Amer. Statist. Assoc., 87, 
493-500. 

Carlin, B.P. and Sargent, D.J. (1996). Robust Bayesian approaches for clinical 
trial monitoring. Statistics in Medicine, 15. 1093-1106. 

Carlin, J.B. and Louis, T.A. (1985). Controlling error rates by using conditional 
expected power to select tumor sites. Proc. Biopharmaceutical Section of the 
Amer. Statist. Assoc., 11-18. 

Carter, C.K. and Kohn, R. (1994). On Gibbs sampling for state space models. 
Biometrika, 81, 541-553. 

Carter, C.K. and Kohn. R. (1996). Markov chain Monte Carlo in conditionally 
Gaussian state space models. Biometrika, 83,589-601. 

Casella, G. (1985). An introduction to empirical Bayes data analysis. The Amer¬ 
ican Statistician, 39, 83-87. 

Casella, G. and Berger, R.L. (1990). Statistical Inference. Pacific Grove, CA: 
Wadsworth & Brooks-Cole. 

Casella, G. and George, E. (1992). Explaining the Gibbs sampler. The American 
Statistician, 46, 167-174. 

Casella, G. and Hwang, J. (1983). Empirical Bayes confidence sets for the mean 
of a multivariate normal distribution. J. Amer. Statist. Assoc., 78, 688-698. 

Centers for Disease Control and Prevention, National Center for Health Statis¬ 
tics (1988). Public Use Data Tape Documentation Compressed Mortality File, 
1968-1985. Hyattsvile, Maryland: U.S. Department of Health and Human Ser¬ 
vices. 

Chaloner, K. (1996). The elicitation of prior distributions. In Bayesian Biostatis¬ 
tics, eds. D. Berry and D. Stangl, New York: Marcel Dekker, pp. 141-156. 

Chaloner, K., Church, T., Louis, T.A., and Matts, J.P. (1993). Graphical elicita¬ 
tion of a prior distribution for a clinical trial. The Statistician, 42, 341-353. 

Chen, D.-G., Carter, E.M., Hubert, J.J., and Kim, P.T. (1999). Empirical Bayes 
estimation for combinations of multivariate bioassays. Biometrics, 55, 1038- 
1043. 

Chen, M.-H. (1994). Importance-weighted marginal Bayesian posterior density 
estimation. J. Amer. Statist. Assoc., 89, 818-824. 

Chen, M.-H., Shao, Q.-M., and Ibrahim, J.G. (2000). Monte Carlo Methods in 
Bayesian Computation. New York: Springer-Verlag. 

Chib, S. (1995). Marginal likelihood from the Gibbs output. J. Amer. Statist. 
Assoc., 90, 1313-1321. 

Chib, S. (1996). Calculating posterior distributions and modal estimates in 
Markov mixture models. Journal of Econometrics , 75, 79-97. 

Chib, S. and Carlin. B.P. (1999). On MCMC sampling in hierarchical longitudinal 


© 2000 by CRC Press LLC 



models. Statistics and Computing, 9, 17-26. 

Chib, S. and Greenberg, E. (1995a). Understanding the Metropolis-Hastings al¬ 
gorithm. The American Statistician, 49,327-335. 

Chib, S. and Greenberg, E. (1995b). Hierarchical analysis of SUR models with ex¬ 
tensions to correlated serial errors and time-varying parameter models. Journal 
of Econometrics, 68, 339-360. 

Chib, S. and Greenberg, E. (1998). Analysis of multivariate probit models. 
Biometrika, 85, 347-361. 

Choi, S., Lagakos, S.W.. Schooley, R.T., and Volberding, P.A. (1993) CD4+ lym¬ 
phocytes are an incomplete surrogate marker for clinical progession in persons 
with asymptomatic HIV infection taking zidovudine. Ann. Internal Med., 118, 
674-680. 

Christiansen, C.L. and Morris, C.N. (1997). Hierarchical Poisson regression mod¬ 
eling. J. Amer. Statist. Assoc., 92, 618-632. 

Clayton. D.G. (1991). A Monte Carlo method for Bayesian inference in frailty 
models. Biometrics, 47, 467-485. 

Clayton, D.G. and Bernardinelli. L. (1992). Bayesian methods for mapping dis¬ 
ease risk. In Geographical and Environmental Epidemiology: Methods for Small- 
Area Studies, P. Elliott, J. Cuzick, D. English, and R. Stern, eds., Oxford: 
Oxford University Press. 

Clayton, D.G. and Kaldor, J.M. (1987). Empirical Bayes estimates of age- 
standardized relative risks for use in disease mapping. Biometrics. 43, 671-681. 

Clyde, M.. DeSimone, H., and Parmigiani, G. (1996). Prediction via orthogonal- 
ized model mixing. J. Amer. Statist. Assoc., 91, 1197-1208. 

Concorde Coordinating Committee (1994). Concorde: MRC/ANRS random¬ 
ized double-blind controlled trial of immediate and deferred zidovudine in 
symptom-free HIV infection. Lancet, 343,871-881. 

Conlon. E.M. and Louis, T.A. (1999). Addressing multiple goals in evaluating 
region-specific risk using Bayesian methods. In Disease Mapping and Risk As¬ 
sessment for Public Health, A. Lawson, A. Biggeri, D. Bohning, E. Lesaffre, 
J.-F. Viel, and R. Bertollini, eds.. New York: Wiley, pp. 31-47. 

Cooper, H. and Hedges, L. eds. (1994). The Handbook of Research Synthesis. New 
York: Russell Sage Foundation. 

Cornfield, J. (1966a). Sequential trials, sequential analysis and the likelihood 
principle. The American Statistician, 20. 18-23. 

Cornfield, J. (1966b). A Bayesian test of some classical hypotheses - with appli¬ 
cations to sequential clinical trials. J. Amer Statist. Assoc., 61, 577-594. 

Cornfield, J. (1969). The Bayesian outlook and its applications. Biometrics, 25, 
617-657. 

Cowles, M.K. (1994). Practical issues in Gibbs sampler implementation with ap¬ 
plication to Bayesian hierarchical modeling of clinical trial data. Unpublished 
Ph.D. disssertation. Division of Biostatistics, University of Minnesota. 

Cowles, M.K. and Carlin. B.P. (1996). Markov chain Monte Carlo convergence 
diagnostics: A comparative review. J. Amer Statist. Assoc., 91, 883-904. 

Cowles, M.K., Carlin, B.P., and Connett, J.E. (1996). Bayesian Tobit modeling 
of longitudinal ordinal clinical trial compliance data with nonignorable rniss- 
ingness. J. Amer Statist. Assoc., 91, 86-98. 


© 2000 by CRC Press LLC 



Cowles, M.K. and Rosenthal, J.S. (1998). A simulation approach to convergence 
rates for Markov chain Monte Carlo algorithms. Statistics and Computing, 8, 
115-124. 

Cox, D.R. and Hinkley, D.V. (1974). Theoretical Statistics. London: Chapman 
and Hall. 

Cox, D.R. and Oakes, D. (1984 ). Analysis of Survival Data. London: Chapman 
and Hall. 

Creasy, M.A. (1954). Limits for the ratio of means. J. Roy. Statist. Soc., Ser. B. 
16, 186-194. 

Cressie, N. (1989). Empirical Bayes estimates of undercount in the decennial 
census. J. Amer. Statist. Assoc.. 84, 1033-1044. 

Cressie, N.A.C. (1993). Statistics for Spatial Data, revised ed. New York: Wiley. 

Cressie, N. and Chan, N.H. (1989). Spatial modeling of regional variables. J. 
Amer. Statist. Assoc., 84. 393-401. 

Damien, P., Laud, P.W., and Smith, A.F.M. (1995). Approximate random variate 
generation from infinitely divisible distributions with applications to Bayesian 
inference. J. Roy. Statist. Soc., Ser. B, 57, 547-563. 

Damien, P., Wakefield. J., and Walker, S. (1999). Gibbs sampling for Bayesian 
non-conjugate and hierarchical models by using auxiliary variables. J. Roy. 
Statist. Soc., Ser. B, 61, 331-344. 

Deely, J.J. and Lindley, D.V. (1981). Bayes empirical Bayes. J. Amer. Statist. 
Assoc., 76, 833-841. 

DeGroot, M.H. (1970). Optimal Statistical Decisions. New York: McGraw-Hill. 

Dellaportas, P. and Smith, A.F.M. (1993). Bayesian inference for generalised 
linear and proportional hazards models via Gibbs sampling. J. Roy. Statist. 
Assoc., Ser. C (Applied Statistics), 42, 443-459. 

Dellaportas, P., Forster, J.J., and Ntzoufras, 1. (1998). On Bayesian model and 
variable selection using MCMC. Technical report. Department of Statistics, 
Athens University of Economics and Business. 

Dempster, A.P.. Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood esti¬ 
mation from incomplete data via the EM algorithm (with discussion). J. Roy. 
Statist. Soc., Ser. B, 39, 1-38. 

Dempster, A.P., Selwyn, M.R.. and Weeks. B.J. (1983). Combining historical 
and randomized controls for assessing trends in proportions. J. Amer. Statist. 
Assoc., 78, 221-227. 

DerSimonian, R. and Laird. N.M. (1986). Meta-analysis in clinical trials. Con¬ 
trolled Clinical Trials, 7, 177-188. 

DeSantis, F. and Spezzaferri, F. (1999). Methods for default and robust Bayesian 
model comparison: the fractional Bayes factor approach. International Statis¬ 
tical Review, 67, 267-286. 

DeSouza, C.M. (1991). An empirical Bayes formulation of cohort models in cancer 
epidemiology. Statistics in Medicine, 10, 1241-1256. 

Devine, O.J. (1992). Empirical Bayes and constrained empirical Bayes methods 
for estimating incidence rates in spatially aligned areas. Unpublished Ph.D. 
dissertation. Division of Biostatistics, Emory University. 

Devine, O.J., Halloran, M.E., and Louis, T.A. (1994). Empirical Bayes methods 
for stabilizing incidence rates prior to mapping. Epidemiology, 5, 622-630. 


© 2000 by CRC Press LLC 



Devine, O.J. and Louis, T.A. (1994). A constrained empirical Bayes estimator 
for incidence rates in areas with small populations. Statistics in Medicine, 13, 
1119-1133. 

Devine, O.J., Louis, T.A., and Halloran, M.E. (1994). Empirical Bayes estimators 
for spatially correlated incidence rates. Environmetrics, 5, 381-398. 

Devine, O.J., Louis, T.A., and Halloran, M.E. (1996). Identifying areas with ele¬ 
vated incidence rates using empirical Bayes estimators. Geographical Analysis, 
28, 187-199. 

Devroye, L. (1986). Non-Uniform Random Variate Generation. New York: 
Springer-V erlag. 

Dey, D.K. and Berger, J.O. (1983). On truncation of shrinkage estimators in 
simultaneous estimation of normal means. J. Amer. Statist. Assoc., 78, 865- 
869. 

Diebolt, J. and Robert, C.P. (1994). Estimation of finite mixture distributions 
through Bayesian sampling. J. Roy. Statist. Soc., Ser. B, 56, 363-375. 

Diggle, P.J. (1983). Statistical Analysis of Spatial Point Patterns. London: Aca¬ 
demic Press. 

Diggle, P.J., Tawn. J.A., and Moyeed, R.A. (1998). Model-based geostatistics 
(with discussion). /. Roy. Statist. Soc., Ser. C, 47, 299-350. 

Doss, H. and Narasimhan, B. (1994). Bayesian Poisson regression using the Gibbs 
sampler: Sensitivity analysis through dynamic graphics. Technical report. De¬ 
partment of Statistics, The Ohio State University. 

Draper. D. (1995). Assessment and propagation of model uncertainty (with dis¬ 
cussion). J. Roy. Statist. Soc., Ser. B, 57, 45-97. 

DuMouchel, W.H. (1988). A Bayesian model and a graphical elicitation procedure 
for multiple comparisons. In Bayesian Statistics 3, eds. J.M. Bernardo, M.H. 
DeGroot, D.V. Lindley, and A.F.M. Smith, Oxford: Oxford University Press, 
pp. 127-145. 

DuMouchel, w.H. and Harris, J.E. (1983). Bayes methods for combining the 
results of cancer studies in humans and other species (with discussion). J. 
Amer. Statist. Assoc., 78, 293-315. 

Eberly, L.E. and Carlin. B.P. (2000). Identifiability and convergence issues for 
Markov chain Monte Carlo fitting of spatial models. To appear. Statistics in 
Medicine. 

Ecker, M.D. and Gelfand, A.E. (1999). Bayesian modeling and inference for ge¬ 
ometrically anisotropic spatial data. Mathematical Geology, 31, 67-83. 

Edwards, R.G. and Sokal. A.D. (1988). Generalization of the Fortium-Kasteleyn- 
Swendsen-Wang representation and Monte Carlo algorithm. Physical Review 
D, 38, 2009-2012. 

Edwards. W., Lindman, H., and Savage, L.J. (1963). Bayesian statistical inference 
for psychological research. Psych. Rev., 70, 193-242. 

Efron, B. (1986). Why isn't everyone a Bayesian? (with discussion). Amer. Statis¬ 
tician, 40, 1-11. 

Efron, B. (1996). Empirical Bayes methods for combining likelihoods (with dis¬ 
cussion). J. Amer. Statist. Assoc., 91, 538-565. 

Efron, B. and Morris, C.N. (1971). Limiting the risk of Bayes and empirical Bayes 
estimators - Part 1: The Bayes case. J. Amer. Statist. Assoc., 66, 807-815. 


© 2000 by CRC Press LLC 



Efron. B. and Morris, C.N. (1972a). Limiting the risk of Bayes and empirical 
Bayes estimators - Part 11: The empirical Bayes case. J. Amer. Statist. Assoc., 
67, 130-139. 

Efron, B. and Morris, C.N. (1972b). Empirical Bayes on vector observations: An 
extension of Stein's method. Biomettika, 59, 335-347. 

Efron. B. and Morris, C. (1973a). Stein's estimation rule and its competitors - 
An empirical Bayes approach. J. Amer. Statist. Assoc., 68. 117-130. 

Efron, B. and Morris, C. (1973b). Combining possibly related estimation prob¬ 
lems (with discussion). J. Roy. Statist. Soc., Ser. B, 35, 379-421. 

Efron, B. and Morris, C. (1975). Data analysis using Stein's estimator and its 
generalizations. J. Amer. Statist. Assoc., 70. 311-319. 

Efron, B. and Morris, C. (1977). Stein's paradox in statistics. Scientific American, 
236, 119-127. 

Erkanli, A., Soyer, R., and Costello, E.J. (1999). Bayesian inference for prevalence 
in longitudinal two-phase studies. Biometrics, 55, 1145-1150. 

Escobar. M.D. (1994). Estimating normal means with a Dirichlet process prior. 

J. Amer. Statist. Assoc., 89, 268-277. 

Escobar, M.D. and West, M. (1992). Computing Bayesian nonparametric hierar¬ 
chical models. Discussion Paper 92-A20, Institute for Statistics and Decision 
Sciences, Duke University. 

Escobar. M.D. and West, M. (1995). Bayesian density estimation and inference 
using mixtures. J. Amer. Statist. Assoc., 90, 577-588. 

Etzioni, R. and Carlin, B.P. (1993). Bayesian analysis of the Ames 
Salmonella/microsome assay. In Case Studies in Bayesian Statistics (Lecture 
Notes in Statistics, Vol. 83), eds. C. Gatsonis, J.S. Hodges, R.E. Kass, and 
N.D. Singpurwalla, New York: Springer-Verlag, pp. 311-323. 

Evans, M. and Swartz, T. (1995). Methods for approximating integrals in statis¬ 
tics with special emphasis on Bayesian integration problems. Statistical Sci¬ 
ence. 10, 254-272; discussion and rejoinder: 11, 54-64. 

Fay R.E., and Herriot, R.A. (1979). Estimates of income for small places: An 
application of James-Stein procedures to census data. J. Amer. Statist. Assoc., 
74, 269-277. 

Ferguson. T.S. (1967). Mathematical Statistics: A Decision Theory Approach. 
New York: Academic Press. 

Ferguson, T.S. (1973). A Bayesian analysis of some nonparametric problems. 
Ann. Statist., 1, 209-230. 

Fieller, E.C. (1954). Some problems in interval estimation. J. Roy. Statist. Soc., 
Ser. B, 16, 175-185. 

Finney, D.J. (1947). The estimation from individual records of the relationship 
between dose and quantal response. Biomettika, 34, 320-334. 

Fleming, T.R. (1992). Evaluating therapeutic interventions: Some issues and ex¬ 
periences (with discussion). Statistical Science, 7, 428-456. 

Freedman, L.S. and Spiegelhalter, D.J. (1983). The assessment of subjective opin¬ 
ion and its use in relation to stopping rules for clinical trials. The Statistician, 
33, 153-160. 

Freedman, L.S. and Spiegelhalter, D.J. (1989). Comparison of Bayesian with 
group sequential methods for monitoring clinical trials. Controlled Clinical Tri- 


© 2000 by CRC Press LLC 



eds, 10, 357-367. 

Freedman, L.S. and Spiegelhalter, D.J. (1992). Application of Bayesian statistics 
to decision making during a clinical trial. Statistics in Medicine, 11, 23-35. 

Fuller, W.A. (1987). Measurement Etror Models. New York: Wiley. 

Gamerman, D. (1997). Markov Chain Monte Carlo: Stochastic Simulation for 
Bayesian Inference. London: Chapman and Hall. 

Gatsonis, C., Hodges, J.S., Kass, R.E., and Singpurwalla, N.D.. eds. (1993). Case 
Studies in Bayesian Statistics. New York: Springer-Verlag. 

Gatsonis, C., Hodges, J.S., Kass, R.E., and Singpurwalla. N.D.. eds. (1995). Case 
Studies in Bayesian Statistics, Volume II. New York: Springer-Verlag. 

Gatsonis, C., Hodges, J.S., Kass, R.E., McCulloch, R.E., Rossi, P., and Singpur¬ 
walla, N.D.. eds. (1997). Case Studies in Bayesian Statistics, Volume III. New 
York: Springer-Verlag. 

Gatsonis, C., Kass, R.E., Carlin, B.P.. Carriquiry. A.L., Gelman, A.. Verdinelli, 
I., and West, M., eds. (1999). Case Studies in Bayesian Statistics, Volume IV. 
New York: Springer-Verlag. 

Gatsonis, C., Kass, R.E., Carlin, B.P.. Carriquiry. A.L., Gelman, A.. Verdinelli, 

I. , and West, M., eds. (2000). Case Studies in Bayesian Statistics, Volume V. 
New York: Springer-Verlag. 

Gaver, D.P. and O'Muircheartaigh, I.G. (1987). Robust empirical Bayes analyses 
of event rates. Technometrics, 29, 1-15. 

Geisser, S. (1993). Predictive Inference. London: Chapman and Hall. 

Gelfand, A.E. and Carlin, B.P. (1993). Maximum likelihood estimation for con¬ 
strained or missing data models. Canad. J. Statist ., 21, 303-311. 

Gelfand, A.E., Carlin, B.P., and Trevisani, M. (2000). On computation using 
Gibbs sampling for multilevel models. Technical report. Department of Statis¬ 
tics, University of Connecticut. 

Gelfand, A.E. and Dalai, S.R. (1990). A note on overdispersed exponential fam¬ 
ilies. Biometrika, 77, 55-64. 

Gelfand, A.E. and Dey, D.K. (1994). Bayesian model choice: asymptotics and 
exact calculations. J. Roy. Statist. Soc., Ser. B, 56, 501-514. 

Gelfand, A.E., Dey, D.K., and Chang, H. (1992). Model determination using pre¬ 
dictive distributions with implementation via sampling-based methods (with 
discussion). In Bayesian Statistics 4, J.M. Bernardo, J.O. Berger, A.P. Dawid, 
and A.F.M. Smith, eds., Oxford: Oxford University Press, pp. 147-167. 

Gelfand, A.E. and Ghosh, S.K. (1998). Model choice: a minimum posterior pre¬ 
dictive loss approach. Biometrika, 85, 1-11. 

Gelfand. A.E.. Hills, S.E.. Racine-Poon. A., and Smith, A.F.M. (1990). Illustra¬ 
tion of Bayesian inference in normal data models using Gibbs sampling. J. 
Amer. Statist. Assoc., 85, 972-985. 

Gelfand, A.E., and Lee. T.-M. (1993). Discussion of the meeting on the Gibbs 
sampler and other Markov chain Monte Carlo methods. J. Roy. Statist. Soc., 
Ser. B, 55, 72-73. 

Gelfand. A.E. and Sahu, S.K. (1994). On Markov chain Monte Carlo acceleration. 

J. Comp. Graph. Statist., 3, 261-276. 

Gelfand. A.E. and Sahu. S.K. (1999). Identifiability, improper priors and Gibbs 
sampling for generalized linear models. J. Amer. Statist. Assoc., 94, 247-253. 


© 2000 by CRC Press LLC 



Gelfand, A.E., Sahu, S.K., and Carlin, B.P. (1995). Efficient parametrizations for 
normal linear mixed models. Biometrika, 82 , 479-488. 

Gelfand, A.E., Sahu, S.K., and Carlin, B.P. (1996). Efficient parametrizations 
for generalized linear mixed models (with discussion). In Bayesian Statistics 

5, eds. J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith. Oxford: 
Oxford University Press, pp. 165-180. 

Gelfand, A.E. and Smith, A.F.M. (1990). Sampling-based approaches to calcu¬ 
lating marginal densities. J. Ainer. Statist. Assoc., 85, 398-409. 

Gelfand, A.E., Smith. A.F.M.. and Lee, T-M. (1992). Bayesian analysis of con¬ 
strained parameter and truncated data problems using Gibbs sampling. J. 
Amer. Statist. Assoc., 87 , 523-532. 

Gelman, A. (1996). Bayesian model-building by pure thought: some principles 
and examples. Statistica Sinicci, 6, 215-232. 

Gelman, A., Carlin. J., Stern, H., and Rubin, D.B. (1995). Bayesian Data Anal¬ 
ysis. London: Chapman and Hall. 

Gelman, A., Meng, X.-L., and Stern, H.S. (1996). Posterior predictive assessment 
of model fitness via realized discrepancies (with discussion). Statistica Sinica, 

6, 733-807. 

Gelman. A., Roberts, G.O., and Gilks, W.R. (1996). Efficient Metropolis jumping 
rules. In Bayesian Statistics 5, J.M. Bernardo, J.O. Berger, A.P. Dawid, and 
A.F.M. Smith, eds., Oxford: Oxford University Press, pp. 599-607. 

Gelman. A. and Rubin, D.B. (1992). Inference from iterative simulation using 
multiple sequences (with discussion). Statistical Science, 7 . 457-511. 

Geman. S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions and 
the Bayesian restoration of images. IEEE Trans, on Pattern Analysis and Ma¬ 
chine Intelligence, 6, 721-741. 

Gentle. J.E. (1998). Random Number Generation and Monte Carlo Methods. New 
York: Springer-Verlag. 

Genz. A. and Kass, R. (1997). Subregion adaptive integration of functions having 
a dominant peak. J. Comp. Graph. Statist., 6, 92- 111. 

George. E.I. and McCulloch, R.E. (1993). Variable selection via Gibbs sampling. 
J. Ainer. Statist. Assoc., 88. 881-889. 

Geweke. J. (1989). Bayesian inference in econometric models using Monte Carlo 
integration. Econometrica, 57, 1317-1339. 

Geweke. J. (1992). Evaluating the accuracy of sampling-based approaches to the 
calculation of posterior moments (with discussion). In Bayesian Statistics 4, 
J.M. Bernardo. J.O. Berger, A.P. Dawid, and A.F.M. Smith, eds., Oxford: 
Oxford University Press, pp. 169-193. 

Geyer, C.J. (1992). Practical Markov Chain Monte Carlo (with discussion). Sta¬ 
tistical Science. 7 . 473-511. 

Geyer, C.J. (1993). Estimating normalizing constants and reweighting mixtures 
in Markov chain Monte Carlo. Technical report No. 568, School of Statistics, 
University of Minnesota. 

Geyer, C.J. and Thompson. E.A. (1992). Constrained Monte Carlo maximum 
likelihood for dependent data (with discussion). J. Roy. Statist. Soc., Ser. B, 
54 . 657-699. 

Geyer, C.J. and Thompson. E.A. (1995). Annealing Markov chain Monte Carlo 


© 2000 by CRC Press LLC 



with applications to ancestral inference. J. Amer. Statist. Assoc., 90, 909-920. 

Ghosh, M. (1992). Constrained Bayes estimation with applications. J. Amer. 
Statist. Assoc., 87, 533-540. 

Ghosh, M. and Meeden, G. (1986). Empirical Bayes estimation in finite popula¬ 
tion sampling. J. Amer. Statist. Assoc., 81, 1058-1062. 

Ghosh, M. and Rao, J.N.K. (1994). Small area estimation: An appraisal (with 
discussion). Statistical Science, 9, 55-93. 

Gilks, W.R. (1992). Derivative-free adaptive rejection sampling for Gibbs sam¬ 
pling. In Bayesian Statistics 4, J.M. Bernardo, J.O. Berger, A.P. Dawid, and 
A.F.M. Smith, eds., Oxford: Oxford University Press, pp. 641-649. 

Gilks, W.R., Clayton, D.G., Spiegelhalter, D.J., Best, N.G., McNeil, A.J., 
Sharpies, L.D., and Kirby, A.J. (1993). Modelling complexity: Applications 
of Gibbs sampling in medicine. J. Roy. Statist. Soc., Ser. B, 55, 38-52. 

Gilks, W.R., Richardson, S., and Spiegelhalter, D.J., eds. (1996). Markov Chain 
Monte Carlo in Practice. London: Chapman and Hall. 

Gilks, W.R. and Roberts, G.O. (1996). Strategies for improving MCMC. In 
Markov Chain Monte Carlo in Practice, W.R. Gilks, S. Richardson, and D.J. 
Spiegelhalter, D.J., eds, London: Chapman and Hall, pp. 89-114. 

Gilks, W.R., Roberts, G.O., and Sahu, S.K. (1998). Adaptive Markov chain 
Monte Carlo through regeneration. J. Amer. Stat. Assoc., 93, 1045-1054. 

Gilks, W.R., Thomas, A., and Spiegelhalter, D.J. (1992) Software for the Gibbs 
sampler. Computing Science and Statistics, 24, 439-448. 

Gilks, W.R., Wang, C.C., Yvonnet, B., and Coursaget, P. (1993). Random-effects 
models for longitudinal data using Gibbs sampling. Biometrics, 49, 441-453. 

Gilks, W.R. and Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling. 

J. Roy. Statist. Soc., Ser. C (Applied Statistics), 41, 337-348. 

Godsill. S.J. (2000). On the relationship between MCMC model uncertainty 
methods. Technical Report CUED /F-INFENG/TR.305. Engineering Depart¬ 
ment, Cambridge University. 

Goel, P.K. (1988). Software for Bayesian analysis: Current status and additional 
needs (with discussion). In Bayesian Statistics 3, J. M. Bernardo, M.H. DeG- 
root, D.V. Lindley, and A.F.M. Smith, eds., Oxford: Oxford University Press, 
pp. 173-188. 

Goldman, A.I., Carlin, B.P., Crane, L.R., Launer, C., Korvick, J.A., Deyton, 
L., and Abrams, D.I. (1996). Response of CD4+ and clinical consequences 
to treatment using ddl or ddC in patients with advanced HIV infection. J. 
Acquired Immune Deficiency Syndromes and Human Retrovirology, 11, 161- 

169. 

Goldstein. H. (1995). Kendall's Library of Statistics 3: Multilevel Statistical Mod¬ 
els, 2nd ed. London: Arnold. 

Goldstein, H. and Spiegelhalter, D.J. (1996). League tables and their limitations: 
statistical issues in comparisons of institutional performance (with discussion). 

J. Roy. Statist. Soc., Ser. A, 159, 385-443. 

Goldstein, M. (1981). Revising previsions: A geometric interpretation. J. Roy. 
Statist. Soc., Ser. B, 43, 105-130. 

Good, I.J. (1965). The Estimation of Probabilities. Cambdrige. MA: MIT Press. 

Grambsch, P. and Therneau, T.M. (1994). Proportional hazards tests and diag- 


© 2000 by CRC Press LLC 



nostics based on weighted residuals. Biometrika, 81,515-526. 

Green. P.J. (1995). Reversible jump Markov chain Monte Carlo computation and 
Bayesian model determination. Biometrika, 82,711-732. 

Green. P.J. and Murdoch, D.J. (1999). Exact sampling for Bayesian inference: 
towards general purpose algorithms (with discussion). In Bayesian Statistics 
6, J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith, eds., Oxford: 
Oxford University Press, pp. 301-321. 

Green, P.J. and O'Hagan, A. (1998). Carlin and Chib do not need to sample 
from pseudopriors. Research Report 98-1, Department of Statistics, University 
of Nottingham. 

Greenhouse, J.B. and Wasserman, L.A. (1995). Robust Bayesian methods for 
monitoring clinical trials. Statistics in Medicine, 14, 1379-1391. 

Gustafson, P. (1994). Hierarchical Bayesian analysis of clustered survival data. 
Technical Report No. 144, Department of Statistics, University of British 
Columbia. 

Guttman, I. (1982). Linear Models: An Introduction. New York: Wiley. 

Hammersley, J.M. and Handscomb. D.C. (1964). Monte Carlo Methods. London: 
Methuen. 

Han, C. and Carlin. B.P. (2000). MCMC methods for computing Bayes factors: A 
comparative review. Research Report 2000-01, Division of Biostatistics, Uni¬ 
versity of Minnesota, 2000. 

Handcock, M.S. and Stein, M.L. (1993). A Bayesian analysis of kriging. Techno¬ 
metrics, 35, 403-410. 

Handcock, M.S. and Wallis, J.R. (1994). An approach to statistical spatial- 
temporal modeling of meteorological fields (with discussion). J. Amer. Statist. 
Assoc.. 89, 368-390. 

Hardgan, J.A. (1983). Bayes Theoty. New York: Springer-Verlag. 

Hastings, W.K. (1970). Monte Carlo sampling methods using Markov chains and 
their applications. Biometrika, 57, 97-109. 

Heidelberger. P. and Welch, P.D. (1983). Simulation run length control in the 
presence of an initial transient. Operations Research, 31, 1109-1144. 

Higdon. D.M. (1998). Auxiliary variable methods for Markov chain Monte Carlo 
with applications. J. Amer. Statist. Assoc., 93, 585-595. 

Hill, J.R. (1990). A general framework for model-based statistics. Biometrika, 
77, 115-126. 

Hills, S.E. and Smith, A.F.M. (1992). Parametrization issues in Bayesian infer¬ 
ence. In Bayesian Statistics 4, eds. J.M. Bernardo, J.O. Berger, A.P. Dawid, 
and A.F.M. Smith, pp. 641-649, Oxford: Oxford University Press. 

Hobert, J.P. and Casella, G. (1996). The effect of improper priors on Gibbs 
sampling in hierarchical linear mixed models. J. Amer. Statist. Assoc., 91, 
1461-1473. 

Hodges, J.S. (1998). Some algebra and geometry for hierarchical models, applied 
to diagnostics (with discussion). J. Roy. Statist. Soc., Series B, 60. 497-536. 

Hodges, J.S. and Sargent, D.J. (1998). Counting degrees of freedom in hierar¬ 
chical and other richly- par ameterised models. Technical report. Division of 
Biostatistics, University of Minnesota. 

Hogg, R.V. and Craig, A.T. (1978). Introduction to Mathematical Statistics, 4th 


© 2000 by CRC Press LLC 



ed. New York: Macmillan. 

Hui, S.L. and Berger, J.O. (1983). Empirical Bayes estimation of rates in longi¬ 
tudinal studies. J. Amer. Statist. Assoc., 78 , 753-759. 

Hurn, M. (1997). Difficulties in the use of auxiliary variables in Markov chain 
Monte carlo methods. Statistics and Computing, 7, 35-44. 

Jacobson, M.A., Besch, C.L., Child, C., Hafner, R., Matts, J.P., Muth, K., Went¬ 
worth, D.N.. Neaton, J.D., Abrams, D., Rimland, D., Perez, G., Grant, I.H., 
Saravolatz, L.D., Brown, L.S., Deyton. L., and the Terry Beirn Community 
Programs for Clinical Research on AIDS (1994). Primary prophylaxis with 
pyrimethamine for toxoplasmic encephalitis in patients with advanced human 
immunodeficiency virus disease: Results of a randomized trial. J. Infectious 
Diseases, 169, 384-394. 

James, W. and Stein, C. (1961). Estimation with quadratic loss. In Proc. Fourth 
Berkeley Symp. on Math. Statist, and Prob., 1. Berkeley, CA: Univ. of Cali¬ 
fornia Press, pp. 361-379. 

Jeffreys, H. (1961). Theory of Probability, 3rd ed. Oxford: University Press. 

Johnson, V.E. (1998). A coupling-regeneration scheme for diagnosing convergence 
in Markov chain Monte Carlo algorithms. J. Amer. Statist. Assoc., 93 , 238- 
248. 

Kackar, R.N. and Harville, D.A. (1984). Approximations for standard errors of 
estimators of fixed and random effects in mixed linear models. J. Amer. Statist. 
Assoc., 19, 853-862. 

Kadane, J.B. (1986). Progress toward a more ethical method for clinical trials. 

J. of Medical Philosophy, 11, 385-404. 

Kadane, J.B., Chan, N.H., and Wolfson. L.J. (1996). Priors for unit root models. 

J. of Econometrics, 75, 99-111. 

Kadane, J.B., Dickey, J.M., Winkler, R.L., Smith, W.S., and Peters. S.C. (1980). 
Interactive elicitation of opinion for a normal linear model. J. Amer. Statist. 
Assoc., 75 , 845-854. 

Kadane, J.B. and Wolfson, L.J. (1996). Priors for the design and analysis of 
clinical trials. In Bayesian Biostatistics, eds. D. Berry and D. Stangl, New 
York: Marcel Dekker, pp. 157-184. 

Kalbfleisch, J.D. (1978). Nonparametric Bayesian analysis of survival time data. 

J. Roy. Statist. Soc., SerB, 40 . 214-221. 

Kaluzny, S.P., Vega, S.C., Cardoso, T.P., and Shelly, A.A. (1998). S+SpatialStats: 
User's Manual for Windows and UNIX. New York: Springer-Verlag. 

Kass, R.E., Carlin, B.P., Gelman, A., and Neal, R. (1998). Markov chain Monte 
Carlo in practice: A roundtable discussion. The American Statistician, 52 , 
93-100. 

Kass, R.E. and Raftery, A.E. (1995). Bayes factors. J. Amer. Statist. Assoc., 90 , 
773-795. 

Kass, R.E. and Steffey, D. (1989). Approximate Bayesian inference in condition¬ 
ally independent hierarchical models (parametric empirical Bayes models). J. 
Amer. Statist. Assoc., 84 , 717-726. 

Kass, R.E., Tierney, L., and Kadane, J. (1988). Asymptotics in Bayesian compu¬ 
tation (with discussion). In Bayesian Statistics 3, J.M. Bernardo, M.H. DeG- 
root, A.P. Dawid, and A.F.M. Smith, eds., Oxford: Oxford University Press, 


© 2000 by CRC Press LLC 



pp. 261-278. 

Kass, R.E., Tierney, L., and Kadane, J. (1989). Approximate methods for assess¬ 
ing influence and sensitivity in Bayesian analysis. Biometrika, 76. 663-674. 

Kass, R.E. and Vaidyanathan. S. (1992). Approximate Bayes factors and orthog¬ 
onal parameters, with application to testing equality of two binomial propor¬ 
tions. J. Roy. Statist. Soc. B. 54, 129-144. 

Kass, R.E. and Wasserman, L. (1996). The selection of prior distributions by 
formal rules. J. Amer. Statist. Assoc., 91, 1343-1370. 

Killough, G.G., Case, MA., Meyer. K.R., Moore, R.E., Rope, S.K., Schmidt, 
D.W., Schleien, B„ Sinclair, W.K., Voilleque, P.G., and Till, J.E. (1996). Task 
6: Radiation doses and risk to residents from FMPC operations from 1951 
1988. Draft report. Radiological Assessments Corporation, Neeses, SC. 

Kolassa, J.E. and Tanner, M.A. (1994). Approximate conditional inference in 
exponential families via the Gibbs sampler. J. Amer. Statist. Assoc., 89, 697- 
702. 

Knorr-Held, L. and Rasser, G. (2000). Bayesian detection of clusters and discon¬ 
tinuities in disease maps. Biometrics, 56, 13-21. 

Kreft, I.G.G., de Leeuw, J., and van der Leeden, R. (1994). Review of five multi¬ 
level analysis programs: BMDP-5V, GENMOD, HLM, ML3, and VARCL. The 
American Statistician, 48, 324-335. 

Krewski. D., Leroux, B.G., Bleuer, S.R., and Broekhoven, L.H. (1993). Modeling 
the Ames salmonella/microsome Assay. Biometrics, 49, 499-510. 

Krige, D.G. (1951). A statistical approach to some basic mine valuation problems 
on the Witwatersrand. J. Chemical, Metallurgical and Mining Society of South 
Africa, 52.119-139. 

Laird, N.M. (1978). Nonparametric maximum likelihood estimation of a mixing 
distribution. J. Amer. Statist. Assoc., 73, 805-811. 

Laird, N.M., Lange. N., and Stram. D. (1987). Maximum likelihood computations 
with repeated measures: application of the EM algorithm. J. Amer. Statist. 
Assoc., 82, 97-105. 

Laird, N.M. and Louis, T.A. (1982). Approximate posterior distributions for in¬ 
complete data problems. J. Roy. Statist. Soc., Ser. B. 44, 190-200. 

Laird, N.M. and Louis, T.A. (1987). Empirical Bayes confidence intervals based 
on bootstrap samples (with discussion). J. Amer. Statist. Assoc.. 82, 739-757. 

Laird. N.M. and Louis, T.A. (1989). Empirical Bayes confidence intervals for a 
series of related experiments. Biometrics, 45, 481-495. 

Laird, N.M. and Louis, T.A. (1989). Empirical Bayes ranking methods. J. Edu¬ 
cational Statistics, 14, 29-46. 

Laird, N.M. and Louis, T.A. (1991). Smoothing the non-parameterc estimate 
of a prior distribution by roughening: A computational study. Computational 
Statistics and Data Analysis, 12, 27-37. 

Laird, N.M. and Ware, J.H. (1982). Random-effects models for longitudinal data. 
Biometrics, 38, 963-974. 

Lan, K.K.G. and DeMets, D.L. (1983). Discrete sequential boundaries for clinical 
trials. Biometrika, 70, 659-663. 

Lange, N., Carlin, B.P., and Gelfand, A.E. (1992). Hierarchical Bayes models for 
the progression of HIV infection using longitudinal CD4 T-cell numbers (with 


© 2000 by CRC Press LLC 



discussion). J. Amer. Statist. Assoc., 87, 615-632. 

Lange, N. and Ryan, L. (1989). Assessing normality in random effects models. 
Annals of Statistics, 17, 624-642. 

Laplace, P.S. (1986). Memoir on the probability of the causes of events (English 
translation of the 1774 French original by S.M. Stigler). Statistical Science, 1, 
364-378. 

Laud, P. and Ibrahim, J. (1995). Predictive model selection. J. Roy. Statist. Soc., 
Ser. B, 57, 247-262. 

Lauritzen, S.L. and Spiegelhalter, D.J. (1988). Local computations with proba¬ 
bilities on graphical structures and their application to expert systems (with 
discussion). /. Roy. Statist. Soc., Ser. B. 50. 157-224. 

Lavine. J. and Schervish, M.J. (1999). Bayes factors: what they are and what 
they are not. The American Statistician, 53, 119-123. 

Le, N.D., Sun. W.. and Zidek, J.V. (1997). Bayesian multivariate spatial interpo¬ 
lation with data missing by design. J. Roy. Statist. Soc., Ser. B, 59, 501-510. 

Lee, P.M. (1997). Bayesian Statistics: An Introduction, 2nd ed. London: Arnold. 

Leonard, T. and Hsu, J.S.J. (1999). Bayesian Methods. Cambridge: Cambridge 
University Press. 

Liang, K.-Y. and Waclawiw, M.A. (1990). Extension of the Stein estimating pro¬ 
cedure through the use of estimating functions. J. Amer. Statist. Assoc., 85, 
435-440. 

Lin. D.Y., Fischl, M.A., and Schoenfeld. D.A. (1993). Evaluating the role of CD4- 
lymphocyte counts as surrogate endpoints in human immunodeficiency virus 
clinical trials. Statist, in Med., 12, 835-842. 

Lin, D.Y. and Wei, L.J. (1989). The robust inference for the Cox proportional 
hazards model. J. Amer. Statist. Assoc., 84, 1074-1078. 

Lindley, D.V. (1972). Bayesian statistics: A review. Philadelphia: SIAM. 

Lindley, D.V. (1983). Discussion of "Parametric empirical Bayes inference: The¬ 
ory and Applications," by C.N. Morris. J. Amer. Statist. Assoc., 78, 61-62. 

Lindley, D.V. (1990). The 1988 Wald memorial lectures: The present position in 
Bayesian statistics (with discussion). Statistical Science, 5, 44-89. 

Lindley, D.V. and Phillips. L.D. (1976). Inference for a Bernoulli process (a 
Bayesian view). Amer. Statist., 30, 112-119. 

Lindley, D.V. and Smith, A.F.M. (1972). Bayes estimates for the linear model 
(with discussion). J. Roy. Statist. Soc., Ser. B, 34, 1-41. 

Lindsay, B.G. (1983). The geometry of mixture likelihoods: a general theory. Ann. 
Statist., 11, 86-94. 

Lindstrom, M.J. and Bates, D.M. (1990). Nonlinear mixed effects models for 
repeated measures data. Biometrics, 46, 673-687. 

Liseo, B. (1993). Elimination of nuisance parameters with reference priors. 
Biometrika, 80, 295-304. 

Little, R.J.A. and Rubin, D.B. (1987). Statistical Analysis with Missing Data. 
New York: Wiley. 

Liu, C.H., Rubin, D.B., and Wu, Y.N. (1998). Parameter expansion to accelerate 
EM: the PX-EM algorithm. Biometrika, 85, 755-770. 

Liu, J.S. (1994). The collapsed Gibbs sampler in Bayesian computations with 
applications to a gene regulation problem. Journal of the American Statistical 


© 2000 by CRC Press LLC 



Association, 89, 958-966. 

Liu, J.S., Wong, W.H., and Kong, A. (1994). Covariance structure of the Gibbs 
sampler with applications to the comparisons of estimators and augmentation 
schemes. Biometrika, 81, 27-40. 

Liu, J.S. and Wu, Y.N. (1999). Parameter expansion for data augmentation. J. 
Amer. Statist. Assoc., 94, 1264-1274. 

Longmate, J.A. (1990). Stochastic approximation in bootstrap and Bayesian ap¬ 
proaches to interval estimation in hierarchical models. In Computing Science 
and Statistics: Proc. 22nd Symposium on the Interface, New York: Springer- 
Verlag, pp. 446-450. 

Louis, T.A. (1981a). Confidence intervals for a binomial parameter after observing 
no successes. The American Statistician, 35, 151-154. 

Louis, T.A. (1981b). Nonparametric analysis of an accelerated failure time model. 
Biometrika, 68. 381-390. 

Louis, T.A. (1982). Finding the observed information matrix when using the EM 
algorithm. J. Roy. Statist. Soc., Ser. B, 44. 226-233. 

Louis, T.A. (1984). Estimating a population of parameter values using Bayes and 
empirical Bayes methods. J. Amer. Statist. Assoc., 79, 393-398. 

Louis, T.A. (1991). Using empirical Bayes methods in biopharmaceutical research 
(with discussion). Statistics in Medicine, 10, 811-829. 

Louis, T.A. and Bailey, J.K. (1990). Controlling error rates using prior infor¬ 
mation and marginal totals to select tumor sites. J. Statistical Planning and 
Inference, 24, 297-316. 

Louis, T.A. and DerSimonian, R. (1982). Health statistics based on discrete 
population groups. In Regional Variations in Hospital Use, D. Rothberg, ed., 
Boston: D.C. Heath & Co. 

Louis, T.A. and Shen, W. (1999). Innovations in Bayes and empirical Bayes 
methods: estimating parameters, populations and ranks. Statistics in Medicine, 
18. 2493-2505. 

MacEachern, S.N. and Berliner, L.M. (1994). Subsampling the Gibbs sampler. 
The American Statistician, 48, 188-190. 

MacEachern, S.N. and Muller, p. (1994). Estimating mixture of Dirichlet process 
models. Discussion Paper 94-11, Institute for Statistics and Decision Sciences, 
Duke University. 

MacEachern, S.N. and Peruggia, M. (2000a). Subsampling the Gibbs sampler: 
variance reduction. Statisitcs and Probability Letters, 47. 91-98. 

MacEachern, S.N. and Peruggia, M. (2000b). Importance link function estimation 
for Markov chain Monte Carlo methods. To appear. Journal of Computational 
and Graphical Statistics. 

MacGibbon, B. and Tomberlin, T.J. (1989). Small area estimates of proportions 
via empirical Bayes techniques. Survey Methodology, 15, 237-252. 

Madigan, D. and Raftery, A.E. (1994). Model selection and accounting for model 
uncertainty in graphical models using Occam's window. J. Amer. Statist. A s- 
soc., 89, 1535-1546. 

Madigan, D. and York, J. (1995). Bayesian graphical models for discrete data. 
International Statistical Review, 63, 215-232. 

Manton, K.E., Woodbury, M.A., Stallard, E., Riggan, W.B., Creason, J.P., and 


© 2000 by CRC Press LLC 



Pellom, A.C. (1989). Empirical Bayes procedures for stabilizing maps of U.S. 
cancer mortality rates. J. Amer. Statist. Assoc., 84, 637-650. 

Mann, P., Louis, T.A., Lane, T.J.. Gottlieb, L.. Engel, P., and Rippey, R.M. 
(1988). Unfavorable outcomes of drug therapy: Subjective probability versus 
confidence intervals. J. Clin. Phatm. and Therapeutics, 13, 213-217. 

Maritz, J.S. and Lwin, T. (1989). Empirical Bayes Methods, 2nd ed. London: 
Chapman and Hall. 

Marriott, J., Ravishanker. N., Gelfand, A., and Pai, J. (1996). Bayesian analysis 
of ARMA processes: Complete sampling-based inference under exact likeli¬ 
hoods. In Bayesian Analysis in Statistics and Econometrics: Essays in Honor 
of Arnold Zellner, eds. D. Berry, K. Chaloner, and J. Geweke, New York: Wiley, 
pp. 243-256. 

McCullagh, P. and Nelder, J.A. (1989). Generalized Linear Models, 2nd ed. Lon¬ 
don: Chapman and Hall. 

McNeil, A.J. and Gore, S.M. (1996). Statistical analysis of zidovudine (AZT) 
effect on CD4 cell counts in HIV disease. Statistics in Medicine, 15, 75-92. 

Meeden, G. (1972). Some admissible empirical Bayes procedures. Ann. Math. 
Statist., 43, 96-101. 

Meilijson, I. (1989). A fast improvement of the EM algorithm on its own terms. 

J. Roy. Statist. Soc., Ser B, 51, 127-138. 

Meng, X.-L. (1994). Posterior predictive p-values. Ann. Statist., 22, 1142-1160. 

Meng, X.-L. and Rubin, D.B. (1991). Using EM to obtain asymptotic variance- 
covariance matrices: The SEM algorithm. J. Amer. Statist. Assoc., 86, 899- 
909. 

Meng, X.-L. and Rubin. D.B. (1992). Recent extensions to the EM algorithm 
(with discussion). In Bayesian Statistics 4, I. M. Bernardo, J.O. Berger, A.P. 
Dawid, and A.F.M. Smith, eds., Oxford: Oxford University Press, pp. 307- 315. 

Meng, X.-L. and Rubin. D.B. (1993). Maximum likelihood estimation via the 
ECM algorithm: A general framework. Biometrika, 80, 267-278. 

Meng, X.-L. and Van Dyk, D. (1997). The EM algorithm - an old folk-song sung 
to a fast new tune (with discussion). J. Roy. Statist. Soc., Ser. B, 59, 511-567. 

Meng, X.-L. and Van Dyk, D. (1999). Seeking efficient data augmentation schemes 
via conditional and marginal augmentation. Biometrika, 86, 301-320. 

Mengersen. K.L., Robert, C.P., and Guihenneuc-Jouyaux, C. (1999). MCMC con¬ 
vergence diagnostics: a reviewww (with discussion). In Bayesian Statistics 6, 
J.M. Bernardo. J.O. Berger, A.P. Dawid, and A.F.M. Smith, eds., Oxford: 
Oxford University Press, pp. 415-440. 

Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., and Teller, E. 
(1953). Equations of state calculations by fast computing machines. J. Chem¬ 
ical Physics, 21, 1087-1091. 

Meyn, S.P. and Tweedie, R.L. (1993). Markov Chains and Stochastic Stability. 
London: Springer-Verlag. 

Mira, A. (1998). Ordering, Slicing and Splitting Monte Carlo Markov Chains. 
Unpublished Ph.D. thesis. Department of Statistics, University of Minnesota. 

Mira, A., Moller. J., and Roberts, G.O. (1999). Perfect slice samplers. Technical 
report. Department of Statistics, Aalborg University, Denmark. 

Mira, A. and Sargent. D. (2000). Strategies for speeding Markov chain Monte 


© 2000 by CRC Press LLC 



Carlo algorithms. Technical report. Faculty of Economics, University of Insub- 
ria, Varese, Italy. 

Mira, A. and Tierney, L. (1997). On the use of auxiliary variables in Markov 
chain Monte Carlo sampling. Technical report, School of Statistics, University 
of Minnesota. 

Mood, A.M., Graybill, F.A., and Boes, D.C. (1974). Introduction to the Theory 
of Statistics, 3rd ed.. New York: McGraw-Hill. 

Morris. C.N. (1983a). Parametric empirical Bayes inference: Theory and appli¬ 
cations. J. Amer. Statist. Assoc., 78 , 47-65. 

Morris, C.N. (1983b). Natural exponential families with quadratic variance func¬ 
tions: Statistical theory. Ann. Statist., IT, 515-529. 

Morris, C.N. (1988). Determining the accuracy of Bayesian empirical Bayes esti¬ 
mators in the familiar exponential families. In Statistical Decision Theory and 
Related Topics, eds. S.S. Gupta and J.O. Berger, New York: Springer-Verlag, 
327-344. 

Morris, C.N. and Normand, S.L. (1992). Hierarchical models for combining infor¬ 
mation and for meta-analyses (with discussion). In Bayesian Statistics 4, J.M. 
Bernardo, J.O. Berger, A.P. Dawid. and A.F.M. Smith, eds., Oxford: Oxford 
University Press, pp. 321-344. 

Mosteller, F. and Wallace, D.L. (1964). Inference and Disputed Authorship: The 
Federalist. Reading, MA: Addison-Wesley. 

Mukhopadhyay, S. and Gelfand, A.E. (1997). Dirichlet process mixed generalized 
linear models. J. Amer. Statist. Assoc., 92 , 633-639. 

Mugglin, A.S., Carlin, B.P., and Gelfand, A.E. (2000). Fully model based ap¬ 
proaches for spatially misaligned data. To appear. J. Amer. Statist. Assoc. 

Muller, P. (1991). A generic approach to posterior integration and Gibbs sam¬ 
pling. Technical Report 91-09, Department of Statistics, Purdue University. 

Muller, P. and Parmigiani, G. (1995). Optimal design via curve fitting of Monte 
Carlo experiments. J. Amer. Statist. Assoc., 90 , 1322-1330. 

Myers, L.E., Adams, N.H., Hughes, T.J., Williams, L.R.. and Claxton, L.D. 
(1987). An interlaboratory study of an EPA/ Ames/salmonella test protocol. 
Mutation Research, 182 , 121-133. 

Mykland, P., Tierney, L., and Yu, B. (1995). Regeneration in Markov chain sam¬ 
plers. J. Amer. Statist. Assoc., 90 , 233-241. 

Natarajan, R. and McCulloch, C.E. (1995). A note on the existence of the poste¬ 
rior distribution for a class of mixed models for binomial responses. Biometrika, 
82, 639-643. 

Natarajan, R. and Kass, R.E. (2000). Reference Bayesian methods for generalized 
linear mixed models. J. Amer. Statist. Assoc., 95, 227-237. 

National Research Council (2000). Small-Area Income and Poverty Estimates: 
Priorities for 2000 and Beyond. Panel on Estimates of Poverty for Small Ge¬ 
ographic Areas, Committee on National Statistics, C.F. Citro and G. Kalton, 
eds. Washington, D.C.: National Academy Press. 

Naylor, J.C. and Smith. A.F.M. (1982). Applications of a method for the efficient 
computation of posterior distributions. Applied Statistics, 31 , 214-235. 

Neal, R.M. (1996a). Sampling from multimodal distributions using tempered 
transitions. Statistics and Computing, 6 , 353-366. 


© 2000 by CRC Press LLC 



Neal, R.M. (1996b). Bayesian Learning for Neural Networks, New York: Springer- 
Verlag. 

Neal, R.M. (1997). Markov chain Monte Carlo methods based on 'slicing' the 
density function. Technical Report No. 9722, Dept, of Statistics, University of 
Toronto. 

Neal, R.M. (1998). Suppressing random walks in Markov Chain Monte Carlo 
using ordered overrelaxation. In Learning in Graphical Models, M.I. Jordan, 
ed., Dordrecht: Kluwer Academic Publishers, pp. 205-228. 

Newton, M.A., Guttorp, P., and Abkowitz, J.L. (1992). Bayesian inference by 
simulation in a stochastic model from hematology. Computing Science and 
Statistics, 24, 449-455. 

Newton, M.A. and Raftery, A.E. (1994). Approximate Bayesian inference by the 
weighted likelihood bootstrap (with discussion). J. Roy. Statist. Soc., Ser. B, 
56, 1-48. 

Nummelin, E. (1984). General Irreducible Markov Chains and Non-Negative Op¬ 
erators. Cambridge: Cambridge University Press. 

Odell, P.L. and Feiveson, A.H. (1966). A numerical procedure to generate a sam¬ 
ple covariance matrix. J. Amer. Statist. Assoc., 61, 198-203. 

O'Hagan, A. (1994). Kendall's Advanced Theory of Statistics Volume 2b: 
Bayesian Inference. London: Edward Arnold. 

O'Hagan, A. (1995). Fractional Bayes factors for model comparison (with discus¬ 
sion). J. Roy. Statist. Soc., Ser. B, 57, 99-138. 

O'Hagan, A. and Berger, J.O. (1988). Ranges of posterior probabilities for quasi- 
unimodal priors with specified quantiles. J. Am. Statist. Assoc., 83, 503-508. 

Pauler, D.K. (1998). The Schwarz criterion and related methods for normal linear 
models. Biometrika, 85, 13-27. 

Phillips, D.B. and Smith. A.F.M. (1996). Bayesian model comparison via jump 
diffusions. In Markov Chain Monte Carlo in Practice, eds. W.R. Gilks, S. 
Richardson, and D.J. Spiegelhalter, London: Chapman and Hall, pp. 215-239. 

Pinheiro. J.C. and Bates, D.M. (2000). Mixed-Effects Models in S and S-PLUS. 
New York: Springer-Verlag. 

Pole, A., West, M., and Harrison, P.J. (1994). Applied Bayesian Forecasting and 
Time Series Analysis. London: Chapman and Hall. 

Poison, N.G. (1996). Convergence of Markov chain Monte Carlo algorithms (with 
discussion). In Bayesian Statistics 5, eds. J.M. Bernardo, J.O. Berger, A.P. 
Dawid, and A.F.M. Smith, Oxford: Oxford University Press, pp. 297-322. 

Prentice, R.L. (1976). A generalization of the probit and logit model for dose 
response curves. Biometrics, 32, 761-768. 

Press, S.J. (1982). Applied Multivariate Analysis: Using Bayesian and Frequentist 
Methods of Inference, 2nd ed. New York: Krieger. 

Propp, J.G. and Wilson, D.B. (1996). Exact sampling with coupled Markov chains 
and applications to statistical mechanics. Random Structures and Algorithms, 
9, 223-252. 

Racine, A., Grieve, A.P., Fluher, H.. and Smith, A.F.M. (1986). Bayesian methods 
in practice: Experiences in the pharmaceutical industry. Applied Statistics, 35, 
93-120. 

Raftery, A.E. (1996). Approximate Bayes factors and accounting for model un- 


© 2000 by CRC Press LLC 



certainty in generalised linear models. Biometrika, 83, 251-266. 

Raftery, A.E. and Lewis, S. (1992). How many iterations in the Gibbs sampler? 
In Bayesian Statistics 4, J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. 
Smith, eds., Oxford: Oxford University Press, pp. 763-773. 

Raftery, A.E., Madigan, D., and Hoeting, J.A. (1997). Bayesian model averaging 
for linear regression models. J. Amer. Statist. Assoc.. 92, 179-191. 

Richardson, S. and Green, P.J. (1997). Bayesian analysis of mixtures with an 
unknown number of components. J. Roy. Statist. Soc., Sec. B, 59. 731-758 

Ripley, B.D. (1987). Stochastic Simulation. New York: Wiley. 

Ritter, C. and Tanner. M.A. (1992). Facilitating the Gibbs sampler: The Gibbs 
stopper and the griddy Gibbs sampler. J. Amer. Statist. Assoc., 87, 861-868. 

Robbins, H. (1955). An empirical Bayes approach to statistics. In Proc. 3rd Berke¬ 
ley Symp. on Math. Statist, and Proh., 1, Berkeley, CA: Univ. of California 
Press, pp. 157-164. 

Robbins, H. (1983). Some thoughts on empirical Bayes estimation. Ann. Statist., 
1,713-723. 

Robert, C.P. (1994). The Bayesian Choice: A Decision-Theoretic Motivation. 
New York: Springer-Verlag. 

Robert, C.P. (1995). Convergence control methods for Markov chain Monte Carlo 
algorithms. Statistical Science, 10, 231-253. 

Robert, C.P. and Casella, G. (1999). Monte Carlo Statistical Methods. New York: 
Springer-V erlag. 

Roberts, G.O. and Rosenthal, J.S. (1999a). Convergence of slice samper Markov 
chains. J. Roy. Statist. Soc., Ser. B, 61, 643-660. 

Roberts, G.O. and Rosenthal, J.S. (1999b). The polar slice sampler. Technical 
report. Department of Mathematics and Statistics, Lancaster University. 

Roberts, G.O. and Sahu, S.K. (1997). Updating schemes, correlation structure, 
blocking and parameterization for the Gibbs sampler. J. Roy. Statist. Soc., 
Ser. B, 59, 291-317. 

Roberts, G.O. and Smith, A.F.M. (1993). Simple conditions for the convergence 
of the Gibbs sampler and Metropolis-Hastings algorithms. Stochastic Processes 
and their Applications, 49, 207-216. 

Roberts, G.O. and Tweedie, R.L. (1996). Geometric convergence and central 
limit theorems for multidimensional Hastings and Metropolis algorithms. 
Biometrika, 83,95-110. 

Robertson, T., Wright, F.T., and Dykstra, R.L. (1988). Order Restricted Statis¬ 
tical Inference. Chichester: John Wiley & Sons. 

Rosenthal, J.S. (1993). Rates of convergence for data augmentation on finite 
sample spaces. Ann. App. Prob., 3, 819-839. 

Rosenthal, J.S. (1995a). Rates of convergence for Gibbs sampling for variance 
component models. Ann. Statist., 23, 740-761. 

Rosenthal, J.S. (1995b). Minorization conditions and convergence rates for 
Markov chain Monte Carlo. J. Amer. Statist. Assoc., 90. 558-566. 

Rosenthal, J.S. (1996). Analysis of the Gibbs sampler for a model related to 
James-Stein estimators. Statistics and Computing, 6, 269-275. 

Rubin, D.B. (1980). Using empirical Bayes techniques in the law school validity 
studies (with discussion). J. Amer. Statist. Assoc., 75, 801-827. 


© 2000 by CRC Press LLC 



Rubin. D.B. (1984). Bayesianly justifiable and relevant frequency calculations for 
the applied statistician. Ann. Statist., 12, 1151-1 172. 

Rubin. D.B. (1988). Using the SIR algorithm to simulate posterior distributions 
(with discussion). In Bayesian Statistics 3, J.M. Bernardo, M.H. DeGroot, 
D.V. Lindley, and A.F.M. Smith, eds., Oxford: Oxford University Press, pp. 
395-402. 

Samaniego, F.J. and Reneau, D.M. (1994). Toward a reconciliation of the 
Bayesian and frequentist approaches to point estimation. J. Amer. Statist. 
Assoc., 89, 947-957. 

Sargent, D.J. (1995). A general framework for random effects survival analysis 
in the Cox proportional hazards setting. Research Report 95-004, Division of 
Biostatistics, University of Minnesota. 

Sargent, D.J. and Carlin, B.P. (1996). Robust Bayesian design and analysis of 
clinical trials via prior partitioning (with discussion). In Bayesian Robustness, 
IMS Lecture Notes - Monograph Series, 29, eds. J.O. Berger et al., Hayward, 
CA: Institute of Mathematical Statistics, pp. 175-193. 

Sargent, D.J., Hodges, J.S., and Carlin, B.P. (2000). Structured Markov chain 
Monte Carlo. To appear. Journal of Computational and Graphical Statistics. 

Schervish, M.J. and Carlin, B.P. (1992). On the convergence of successive sub¬ 
stitution sampling. J. Computational and Graphical Statistics, 1, 111-127. 

Schluchter, M.D. (1988). "5V: Unbalanced repeated measures models with struc¬ 
tured covariance matrices," in BMDP Statistical Software Manual, Vol. 2, W.J. 
Dixon, ed., pp. 1081-1114. 

Schmeiser, B. and Chen. M.H. (1991). On random-direction Monte Carlo sam¬ 
pling for evaluating multidimensional integrals. Technical Report #91-39, De¬ 
partment of Statistics, Purdue University. 

Schmittlein. D.C. (1989). Surprising inferences from unsurprising observations: 
Do conditional expectations really regress to the mean? The American Statis¬ 
tician ,43, 176-183. 

Schumacher, M., Olschewski, M., and Schmoor, C. (1987). The impact of hetero¬ 
geneity on the comparison of survival times. Statist, in Med., 6,773-784. 

Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist., 6, 461- 
464. 

Searle, S.R., Casella, G., and McCulloch, C.E. (1992). Variance Components. 
New York: John Wiley & Sons. 

Shen. W. and Louis, T.A. (1998). Triple-goal estimates in two-stage, hierarchical 
models. J. Roy. Statist. Soc., Ser. B, 60, 455-471. 

Shen. W. and Louis, T.A. (1999). Empirical Bayes estimation via the smoothing 
by roughening approach. J. Comp. Graph. Statist., 8, 800-823. 

Shen. W. and Louis, T.A. (2000). Triple-goal estimates for disease mapping. To 
appear. Statistics in Medicine . 

Shih, J. (1995). Sample size calculation for complex clinical trials with survival 
endpoints. Controlled Clinical Trials , 16. 395-407. 

Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. 
London: Chapman and Hall. 

Simar, L. (1976). Maximum likelihood estimation of a compound Poisson process. 
Ann. Statist., 4, 1200-1209. 


© 2000 by CRC Press LLC 



Sinha, D. (1993). Semiparametric Bayesian analysis of multiple event time data. 
J. Amer. Statist. Assoc.. 88, 979-983. 

Smith, A.F.M. and Gelfand, A.E. (1992). Bayesian statistics without tears: A 
sampling-resampling perspective. The American Statistician , 46, 84-88. 

Smith, A.F.M. and Roberts, G.O. (1993). Bayesian computation via the Gibbs 
sampler and related Markov chain Monte Carlo methods. J. Roy. Statist. Soc., 
Ser. B, 55, 1-24. 

Snedecor, G.W. and Cochran, W.G. (1980). Statistical Methods, seventh edition. 
Ames, Iowa: The Iowa State University Press. 

Spiegelhalter, D.J. (2000). Bayesian methods for cluster randomised trials with 
continuous responses. Technical report. Medical Research Council Biostatistics 
Unit, Institute of Public Health, Cambridge University. 

Spiegelhalter, D.J., Best, N., and Carlin, B.P. (1998). Bayesian deviance, the effec¬ 
tive number of parameters, and the comparison of arbitrarily complex models. 
Research Report 98-009, Division of Biostatistics, University of Minnesota. 

Spiegelhalter, D.J., Dawid, A.P., Lauritzen, S.L., and Cowell. R.G. (1993). 
Bayesian analysis in expert systems (with discussion). Statistical Science , 8. 
219-283. 

Spiegelhalter, D.J., Freedman, L.S., and Parmar, M.K.B. (1994). Bayesian ap¬ 
proaches to randomised trials (with discussion). J. Roy. Statist. Soc., Ser. A, 
157, 357-416. 

Spiegelhalter. D.J., Myles, J.P., Jones, D.R., and Abrams, K.R. (2000). Bayesian 
methods in health technology assessment. To appear. Health Technology As¬ 
sessment. 

Spiegelhalter, D.J., Thomas, A.. Best, N„ and Gilks, W.R. (1995a). BUGS: 
Bayesian inference using Gibbs sampling. Version 0.50. Technical report. Med¬ 
ical Research Council Biostatistics Unit, Institute of Public Health, Cambridge 
University. 

Spiegelhalter, D.J., Thomas, A., Best, N„ and Gilks, W.R. (1995b). BUGS ex¬ 
amples, Version 0.50. Technical report, Medical Research Council Biostatistics 
Unit, Institute of Public Health, Cambridge University. 

Stein, C. (1955). Inadmissibility of the usual estimator for the mean of a multi¬ 
variate normal distribution. In Proc. Third Berkeley Symp. on Math. Statist, 
and Prob., 1, Berkeley, CA: Univ. of California Press, pp. 197-206. 

Stein, C. (1981). Estimation of the parameters of a multivariate normal distribu¬ 
tion: I. Estimation of the means. Annals of Statistics, 9, 1135-1151. 

Stephens, D.A. and Smith, A.F.M. (1992). Sampling-resampling techniques for 
the computation of posterior densities in normal means problems. Test, 1. 
1-18. 

Stern, H.S. and Cressie, N. (1999). Inference for extremes in disease mapping. 
In Disease Mapping and Risk Assessment for Public Health, A. Lawson, A. 
Biggeri, D. Bohning. E. Lesaffre. J.-F. Viel. and R. Bertollini, eds.. New York: 
Wiley, pp. 63-84. 

Staffer, D.S., Scher, M.S., Richardson, G.A., Day, N.L., and Coble, P.A. (1988). A 
Walsh-Fourier analysis of the effects of moderate maternal alcohol consumption 
on neonatal sleep-state cycling. J. Amer. Statist. Assoc., 83, 954-963. 

Swendsen, R.H. and Wang, J.-S. (1987). Nonuniversal critical dynamics in Monte 


© 2000 by CRC Press LLC 



Carlo simulations. Phys. Rev. Letters, 58, 86-88. 

Tamura, R.N. and Young, S.S. (1986). The incorporation of historical control 
information in tests of proportions: Simulation study of Tarone's procedure. 
Biometrics, 42, 343-349. 

Tanner, M.A. (1993). Tools for Statistical Inference: Methods for the Explo¬ 
ration of Posterior Distributions and Likelihood Functions, 2nd ed. New York: 
Springer-Verlag. 

Tanner, M.A. and Wong, W.H. (1987). The calculation of posterior distributions 
by data augmentation (with discussion). J. Amer. Statist. Assoc., 82, 528-550. 

Tarone, R.E. (1982). The use of historical control information in testing for a 
trend in proportions. Biometrics, 38, 215-220. 

Ten Have, T.R. and Localio, A.R. (1999). Empirical Bayes estimation of random 
effects parameters in mixed effects logistic regression models. Biometrics, 55, 
1022-1029. 

Thisted, R.A. (1988). Elements of Statistical Computing. London: Chapman and 
Hall. 

Tierney, L. (1990). LISP-STAT - An Object-Oriented Environment for Statistical 
Computing and Dynamic Graphics. New York: John Wiley & Sons. 

Tierney, L. (1994). Markov chains for exploring posterior distributions (with dis¬ 
cussion). Ann. Statist., 22, 1701-1762. 

Tierney, L. and Kadane, J.B. (1986). Accurate approximations for posterior mo¬ 
ments and marginal densities. J. Amer. Statist. Assoc., 81, 82-86. 

Tierney, L., Kass, R.E., and Kadane. J.B. (1989). Fully exponential Laplace ap¬ 
proximations to expectations and variances of nonpositive functions. J. Amer. 
Statist. Assoc., 84, 710-716. 

Titterington, D.M.. Makov, U.E. and Smith, A.F.M. (1985). Statistical Analysis 
of Finite Mixture Distributions. Chichester: John Wiley & Sons. 

Tolbert. P.. Mulholland, J., Macintosh. D.. Xu, F., Daniels. D.. Devine, O., Car¬ 
lin, B.P.. Butler, A., Nordenberg, D., and White, M. (2000). Air quality and 
pediatric emergency room visits for asthma in Atlanta. Amer. J. Epidemiology, 
151:8, 798-810. 

Tomberlin, T.J. (1988). Predicting accident frequencies for drivers classified by 
two factors . J. Amer. Statist. Assoc., 83, 309-321. 

Treloar, M.A. (1974). Effects of Puromycin on galactosyltransferase of golgi mem¬ 
branes. Unpublished Master's thesis. University of Toronto. 

Troughton, P.T. and Godsill, S.J. (1997). A reversible jump sampler for au¬ 
toregressive time series, employing full conditionals to achieve efficient model 
space moves. Technical Report CUED/F-INFENG/TR.304. Engineering De¬ 
partment, Cambridge University. 

Tsiatis, A.A. (1990). Estimating regression parameters using linear rank tests for 
censored data . Annals of Statistics, 18, 354-372. 

Tsutakawa, R.K.. Shoop, G.L., and Marienfeld, C.J. (1985). Empirical Bayes 
estimation of cancer mortality rates. Statist, in Med., 4, 201-212. 

Tukey, J.W. (1974). Named and faceless values: An initial exploration in memory 
of Prasanta C. Mahalanobis. Sankhya, Series A, 36, 125-176. 

van Houwelingen, J.C. (1977). Monotonizing empirical Bayes estimators for a 
class of discrete distributions with monotone likelihood ratio. Statist. Neerl., 


© 2000 by CRC Press LLC 



31,95-104. 

van Houwelingen, J.C. and Thorogood. J. (1995). Construction, validation and 
updating of a prognostic model for kidney graft survival. Statistics in Medicine, 
14, 1999-2008. 

van Ryzin. J. and Susarla, J.V. (1977). On the empirical Bayes approach to 
multiple decision problems. Ann. Statist., 5, 172-181. 

Vaupel, J.W., Manton, K.G., and Stallard. E. (1979). The impact of heterogeneity 
in individual frailty on the dynamics of mortality. Demography, 16, 439-454. 

Vaupel, J.W. and Yashin, A.I. (1985). Heterogeneity's ruses: Some surprising 
effects of selection on population dynamics. The American Statistician, 39, 
176-185. 

Verdinelli, I. (1992). Advances in Bayesian experimental design (with discussion). 
In Bayesian Statistics 4, J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. 
Smith, eds., Oxford: Oxford University Press, pp. 467-481. 

Von Mises, R. (1942). On the correct use of Bayes' formula. Ann. Math. Statist., 
13, 156-165. 

Wagener. D.K. and Williams, D.R. (1993). Equity in environmental health: Data 
collection and interpretation issues. Toxicology and Industrial Health, 9, 775- 
795. 

Wakefield, J. (1996). The Bayesian analysis of population pharmacokinetic mod¬ 
els. J. Amer. Statist. Assoc., 91, 62-75. 

Wakefield, J.C. (1998). Discussion of "Some Algebra and Geometry for Hierarchi¬ 
cal Models, Applied to Diagnostics," by J.S. Hodges. J. Roy. Stat. Soc., Ser. 

B, 60, 523-525. 

Wakefield, J.C., Gelfand, A.E., and Smith. A.F.M. (1991). Efficient generation of 
random variates via the ratio-of-uniforms method. Statistics and Computing, 
1, 129-133. 

Wakefield, J.C., Smith, A.F.M., Racine-Poon, A., and Gelfand, A.E. (1994). 
Bayesian analysis of linear and nonlinear population models using the Gibbs 
sampler. J. Roy. Statist. Soc., Ser. C (Applied Statistics), 43, 201-221. 

Walker, S. (1995). Generating random variates from D-distributions via substi¬ 
tution sampling. Statistics in Computing, 5, 311-315. 

Waller, L.A., Carlin, B.P., Xia, H., and Gelfand, A.E. (1997). Hierarchical spatio- 
temporal mapping of disease rates. J. Amer. Statist. Assoc., 92, 607-617. 

Waller. L.A., Turnbull, B.W., Clark, L.C., and Nasca, P. (1994). Spatial pattern 
analyses to detect rare disease clusters. In Case Studies in Biometry, N. Lange, 
L. Ryan, L. Billard, D. Brillinger, L. Conquest, and J. Greenhouse, eds.. New 
York: John Wiley & Sons, pp. 3-23. 

Walter, S.D. and Birnie, S.E. (1991). Mapping mortality and morbidity patterns: 
An international comparison. International J. Epidemiology, 20, 678-689. 

Ware, J.H. (1989). Investigating therapies of potentially great benefit: ECMO 
(with discussion). Statistical Science, 4. 298-340. 

Wasserman, L. and Kadane, J.B. (1992). Computing bounds on expectations. J. 
Amer. Statist. Assoc., 87, 516-522. 

Waternaux, C., Laird, N.M., and Ware, J.A. (1989). Methods for analysis of 
longitudinal data: Blood lead concentrations and cognitive development. J. 
Amer. Statist. Assoc., 84, 33-41. 


© 2000 by CRC Press LLC 



Wei, G.C.G. and Tanner, M.A. (1990). A Monte Carlo implementation of the 
EM algorithm and the poor man's data augmentation algorithms. J. Amer. 
Statist. Assoc., 85, 699-704. 

West, M. (1992). Modelling with mixtures (with discussion). In Bayesian Statis¬ 
tics 4, J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith, eds., Ox¬ 
ford: Oxford University Press, pp. 503-524. 

West, M., MUller. P., and Escobar, M.D. (1994). Hierarchical priors and mixture 
models, with application in regression and density estimation. In A spects of 
Uncertainty: A Tribute to D. V. Lindley, eds. A.F.M. Smith and P.R. Freeman, 
London: Wiley. 

West, M. and Harrison, P.J. (1989). Bayesian Forecasting and Dynamic Models. 
New York: Springer-Verlag. 

West, M., Harrison, P.J., and Pole, A. (1987). BATS: Bayesian Analysis of Time 
Series. The Professional Statistician, 6, 43-46. 

Williams, E. (1959). Regression Analysis. New York: Wiley. 

Whittemore, A.S. (1989). Errors-in-variables regression using Stein estimates. 
The American Statistician. 43 , 226-228. 

Wolfinger, R. (1993). Laplace's approximation for nonlinear mixed models. 
Biometrika, 80 , 791-795. 

Wolfinger, R. and Kass, R.E. (2000). Non-conjugate Bayesian analysis of variance 
component models. To appear. Biometrics. 

Wolfinger, R. and O'Connell, M. (1993). Generalized linear mixed models: A 
pseudo-likelihood approach. Journal of Statistical Computation and Simula¬ 
tion, 48 , 233-243. 

Wolfinger. R.D. and Rosner, G.L. (1996). Bayesian and frequendst analyses of 
an in vivo experiment in tumor hemodynamics. In Bayesian Biostatistics, eds. 
D. Berry and D. Stangl, New York: Marcel Dekker, pp. 389-410. 

Wolfson. L.J. (1995). Elicitation of priors and utilities for Bayesian analysis. 
Unpublished Ph.D. disssertation. Department of Statistics, Carnegie Mellon 
University. 

Wolpert, R.L. and Ickstadt. K. (1998). Poisson/gamma random field models for 
spatial statistics. Biometrika, 85,251-267. 

Wooff, D. (1992). [B/D] works. In Bayesian Statistics 4, J.M. Bernardo, J.O. 
Berger, A.P. Dawid, and A.F.M. Smith, eds., Oxford: Oxford University Press, 
pp. 851-859. 

Xia, H., Carlin, B.P., and Waller, L.A. (1997). Hierarchical models for mapping 
Ohio lung cancer rates. Environmetrics, 8, 107-120. 

Ye, J. (1998). On measuring and correcting the effects of data mining and model 
selection. Journal of the American Statistical Association, 93 , 120-131. 

Zaslavsky, A.M. (1993). Combining census, dual-system, and evaluation study 
data to estimate population shares. J. Amer. Statist. Assoc., 88, 1092-1105. 

Zeger, S.L. and Karim, M.R. (1991). Generalized linear models with random 
effects; a Gibbs sampling approach. J. Amer. Statist. Assoc.. 86. 79-86. 

Zhou, X.-H., Perkins, A.J., and Hui. S.L. (1999). Comparisons of software pack¬ 
ages for generalized linear multilevel models. The American Statistician, 53, 
282-290. 


© 2000 by CRC Press LLC 



